Can Aha Moments Be Fake?

Identifying True and Decorative Thinking Steps in Chain-of-Thought

Jiachen Zhao1*, Yiyou Sun2*, Weiyan Shi1, Dawn Song2
1 Northeastern University 2 UC Berkeley
* Co-authors | Co-advisors

Overview

Recent large language models (LLMs) can generate long Chain-of-Thought (CoT) at test time, enabling them to solve complex tasks. These reasoning traces are often assumed as a faithful reflection of LLMs’ internal thinking process, and can be used for monitoring LLMs’ unsafe intentions. However, by analyzing the step-wise causality in CoT on a model's prediction using Average Treatment Effect (ATE), we design a True Thinking Score (TTS) and show that LLMs often interleave between

  • (1) true-thinking steps, which are faithfully used to generate model’s final output.
  • (2) decorative-thinking steps, which give the appearance of reasoning but have minimal causal impact on the model’s final output.


We reveal that only a small subset of the total thinking steps that have relatively high scores and causally drive the final prediction. Furthermore, we identify a TrueThinking direction in the latent space of LLMs. By steering LLMs' hidden states along or against this direction, we can control the model to perform or disregard certain CoT steps when computing the result. Fi- nally, we highlight that self-verification steps in CoT can also be decorative, where LLMs do not truly check their solution, while steering along the TrueThinking direction can force internal reasoning over these steps. Overall, our work reveals that LLMs can verbalize reasoning steps without performing them internally, which undermines both the efficiency of LLM reasoning and the trustworthiness of CoT.

Description of image
Figure 1: We find that reasoning steps in CoT may not always be true thinking but instead function as decorative thinking where the model internally is not using those steps to compute its answer. Taking self-verification steps as an example (often called ``Aha moments'' where LLMs rethink their solution with phrases like ``wait''), we first randomly replace the numerical values in the reasoning steps preceding the ``Aha moment'', and then re-prompt the model for the answer using the modified CoT. In the left example, despite the correct reasoning in the self-verification steps, the model disregards them and still outputs the wrong result. In contrast, in the right example, the model truly follows its self-verification and arrives at the correct answer.

True-Thinking Score (TTS)

We use Average Treatment Effect (ATE) to measure the causal impact of one reasoning step s. Let X=1 mean keeping the original step and X=0 mean replacing it with a minimally perturbed version. For each setting, we score the model by its confidence on the reference answer y* (from early-exit on the full CoT):

ATE(c) = S1(c) - S0(c)

Sx(c) = P(y* | C=c, do(X=x)),   x,c ∈ {0,1}

We evaluate two context regimes for the prefix before s: C=1 (intact context) tests whether the step is necessary when other supporting steps remain; C=0 (perturbed context) tests whether the step is sufficient when that support is weakened. This separates "needed with support" from "helpful on its own."

TTS(s) = 0.5 × ( |ATEnec(1)| + |ATEsuf(0)| )

We take absolute ATE values to measure magnitude of causal influence (not direction). Higher TTS means the model truly uses this step in its computation; lower TTS means the step is more likely decorative.

Evaluation Results of Step-wise Causality in CoT

Distribution of TTS scores
Figure 2. (a) Dataset-level distribution of the TTS score on AIME dataset.
Distribution of ATE(c=1) vs ATE(c=0)
Figure 2. (b) Distribution of ATE(c=1) and ATE(c=0) scores.
Example CoT case for TTS
Figure 2. (c) Example CoT case and average TTS by step percentile on AIME dataset.

The distribution of TTS is long-tailed.

As shown in Figure 2. (a), most steps have low scores, while only a small fraction achieve very high scores. On the AIME dataset for Qwen-2.5, the mean is around 0.03: only 6.4% of steps exceed 0.3 and just 2.3% surpass 0.7. This suggests that only a handful of verbalized steps are truly critical, while many others are decorative. Moreover, Figure 2. (b) shows that relying only on intact context (c=1) can miss true-thinking steps that become visible under perturbed context (c=0). Steering experiments confirm that evaluations based solely on intact context can be misleading.

True- and decorative-thinking steps are interleaved.

As seen in Figure 2. (c), steps with high TTS appear throughout the CoT, though later steps tend to score higher on average. This shows that labeling an entire CoT as “faithful” or “unfaithful” is overly coarse. Our results also suggest that task difficulty does not guarantee more faithful reasoning: even on challenging datasets like AIME, many steps remain decorative, mirroring patterns on simpler math tasks.

Self-verification steps can be decorative.

Self-verification (“aha moment”) steps are like “Wait, let me recompute...” where LLMs are trying to refine, check or alternate their solution. However, TTS reveals that many can be decorative: e.g., 12% of self-verification steps for Qwen-2.5 and 21% for Nemotron score below 0.005. In such cases, perturbing earlier context flips correct answers to wrong ones, while the self-check step itself contributes little. This raises concerns about the efficiency of reasoning, since models may appear to self-verify without actually doing so. An example is shown in Figure 3. More examples with different reasoning behaviors are provided in the Labeled Initial-Step Examples section.

Example CoT case for TTS
Figure 3: Example of unfaithful self-verification steps (highlighted in blue) where the TTS score of each step is found smaller than 0.005. Low TTS indicates that those steps are not truly engaged in computation; rather, these reasoning steps are likely to be decorative and function as an appearance of self-verification, contributing minimally to the model's final prediction.

True Thinking Can Be Mediated by a Steering Direction

We empirically show that whether an LLM truly reasons through a verbalized step or internally disregards it can be mediated by a steering direction in latent space. We first compute a TrueThinking direction per layer by subtracting mean hidden states of high-TTS from low-TTS steps.

Direction at layer \(l\)
$$ v^{l}_{\textsf{TrueThinking}} \;=\; \mu^{l}_{\text{TT}} - \mu^{l}_{\text{DT}} \tag{1} $$
Activation addition (steering)
$$ \bar{h}^{l} \;=\; h^{l} + v^{l}_{\textsf{TrueThinking}} \tag{2} $$
(applied to all tokens in the target step)

For steering at test time, we modify the residual stream for the hidden state of a test step in the example by using activation addition at a single layer \(l\) to all tokens in the step.


Causal Tests

Two causal steering tests to validate the identified TrueThinking direction:

🧪 Test A: Engagement Test

Can steering make the model think through a step it normally ignores?

We focus on cases where the model normally ignores the error we inject in the perturbed step and outputs the right answer. We examine whether steering can make the model think through the perturbed step to follow injected errors and thus, flip the right answer to the wrong answer.

🔄 Test B: Disengagement Test

Can reverse steering make the model disregard a step internally?

We focus on cases where the model normally thinks through the perturbed step and outputs the wrong answer. We examine whether steering can make the model disregard the perturbed step and flip the wrong answer to the right answer.

In both cases, flipping from correct→incorrect or incorrect→correct answers after steering demonstrates that we can causally mediate reasoning engagement.

main table
Table 1: Top-1 flip rate among all layers (%) $\uparrow$ on Engagement Test (ET) and Disengagement Test (DT). We use flip rate as the metric, measuring how often steering changes the model's initial prediction. AMC dataset is in-domain evaluation where TrueThinking directions are extracted, while the other two datasets are for out-of-domain evaluation.

Steering Reveals a Latent Signal of Thinking

As shown in Table 1, steering along the TrueThinking direction reliably flips predictions in both causal tests, while steering in the reverse direction suppresses the model’s use of the target reasoning step. These effects are far stronger than those of random vectors, confirming that the identified direction captures a genuine internal representation of reasoning rather than noise.

Experiments across datasets further show that the latent signal controlling step engagement is universal. A direction extracted on AMC generalizes effectively to MATH and AIME, revealing a model-internal mechanism of thinking rather than a dataset-specific artifact. As shown in Figure 4, in the Qwen models, layers 15–22 consistently yield the strongest intervention performance, suggesting that intermediate layers concentrate latent reasoning.

steering layer
Figure 4: cross-domain results, where the TrueThinking direction is extracted on AMC and applied to MATH and AIME.

TrueThinking Direction Influences Attention. Steering along the TrueThinking direction increases a model’s attention to reasoning steps, while steering in the reverse direction decreases it. Example attention maps are shown in Figure 5. Those results suggest that the TrueThinking direction may control the engagement of reasoning steps by modulating attention to the step. However, the true-thinking behavior cannot be reproduced by directly scaling attention weights as shown in Table 1. It may reveal a directional reasoning circuit: LLMs decide whether to engage in reasoning before modulating attention, a behavior that cannot be reproduced by directly scaling attention weights.

attn
Figure 5: Normalized attention scores of the step in Engagement Test (ET) and Disengagement Test (DT) before and after steering. (a–b) Applying the TrueThinking direction to a step increases the model’s attention to it. (c–d) Applying the reverse TrueThinking direction decreases the model’s attention.

Steering Decorative Self-Verification

Self-verification steps (e.g., “Wait, let me check…”) can appear logical yet remain decorative, having little influence on the model’s internal reasoning. By steering along the TrueThinking direction during these self-check steps, we can force the model to truly reason through them and often restore correct answers that were otherwise lost under perturbations. Steering along the TrueThinking direction can restore correct answers in up to 52% of initially unfaithful cases.

Examples of Decorative Steps

Uncertainty-estimation example
Label Uncertainty-estimation
Context (last two sentences)
Another way is to divide the quadrilateral into two triangles and find the area of each triangle, then add them up. Looking at the points, maybe I can split the quadrilateral along the x-axis
Initial step — Uncertainty-estimation
The points (-1, 0) and (2, 0) are on the x-axis, so the quadrilateral is split into two triangles: one above the x-axis and one below. Wait, actually, the point (0, 1) is above the x-axis, and (0, -3) is below. So, if I connect (-1, 0) to (0, 1) to (2, 0) to (0, -3) and back to (-1, 0), it's a quadrilateral that crosses over the x-axis. Alternatively, maybe I can split it into two triangles: one with vertices (-1, 0), (0, 1), (2, 0) and another with vertices (2, 0), (0, -3), (-1, 0). Wait, but that might not cover the entire quadrilateral. Alternatively, perhaps split it into triangles with a common diagonal. Let me see. If I connect (0, 1) to (0, -3), that's a vertical line along the y-axis. Then, the quadrilateral is split into two parts: one on the left side of the y-axis and one on the right. But actually, the points are (-1, 0), (0, 1), (2, 0), (0, -3)
Uncertainty-estimation example
Label Uncertainty-estimation
Context (last two sentences)
Looking at the points, maybe I can split the quadrilateral along the x-axis. The points (-1, 0) and (2, 0) are on the x-axis, so the quadrilateral is split into two triangles: one above the x-axis and one below
Initial step — Uncertainty-estimation
Wait, actually, the point (0, 1) is above the x-axis, and (0, -3) is below. So, if I connect (-1, 0) to (0, 1) to (2, 0) to (0, -3) and back to (-1, 0), it's a quadrilateral that crosses over the x-axis. Alternatively, maybe I can split it into two triangles: one with vertices (-1, 0), (0, 1), (2, 0) and another with vertices (2, 0), (0, -3), (-1, 0). Wait, but that might not cover the entire quadrilateral. Alternatively, perhaps split it into triangles with a common diagonal. Let me see. If I connect (0, 1) to (0, -3), that's a vertical line along the y-axis. Then, the quadrilateral is split into two parts: one on the left side of the y-axis and one on the right. But actually, the points are (-1, 0), (0, 1), (2, 0), (0, -3)
Backtracking example
Label Backtracking
Context (last two sentences)
So, putting it all together, 20% of 50% of 80 is 8. That seems correct, but let me verify it another way to make sure I didn't make a mistake
Initial step — Backtracking
Another approach is to multiply all the percentages together first and then apply them to 80. So, 20% is 0. 2, and 50% is 0. 5. Multiplying those together: 0. 2 * 0. 5 = 0. 1
Adding-knowledge example
Label Adding-knowledge
Context (last two sentences)
Wait, but that might not be accurate. Alternatively, maybe split the quadrilateral into two triangles by drawing a diagonal from (-1, 0) to (2, 0)
Initial step — Adding-knowledge
Then, the quadrilateral is split into two triangles: one with vertices (-1, 0), (0, 1), (2, 0) and another with vertices (-1, 0), (2, 0), (0, -3). Let me calculate the area of each triangle and add them. First triangle: (-1, 0), (0, 1), (2, 0) Using the formula for the area of a triangle with coordinates: 1/2 |x1(y2 - y3) + x2(y3 - y1) + x3(y1 - y2)| Plugging in the points: x1 = -1, y1 = 0 x2 = 0, y2 = 1 x3 = 2, y3 = 0 So, Area = 1/2 | (-1)(1 - 0) + 0(0 - 0) + 2(0 - 1) | = 1/2 | (-1)(1) + 0 + 2(-1) | = 1/2 | -1 + 0 - 2 | = 1/2 | -3 | = 1/2 * 3 = 1
Adding-knowledge example
Label Adding-knowledge
Context (last two sentences)
Let me calculate the area of each triangle and add them. First triangle: (-1, 0), (0, 1), (2, 0) Using the formula for the area of a triangle with coordinates: 1/2 |x1(y2 - y3) + x2(y3 - y1) + x3(y1 - y2)| Plugging in the points: x1 = -1, y1 = 0 x2 = 0, y2 = 1 x3 = 2, y3 = 0 So, Area = 1/2 | (-1)(1 - 0) + 0(0 - 0) + 2(0 - 1) | = 1/2 | (-1)(1) + 0 + 2(-1) | = 1/2 | -1 + 0 - 2 | = 1/2 | -3 | = 1/2 * 3 = 1.5
Initial step — Adding-knowledge
Second triangle: (-1, 0), (2, 0), (0, -3) Again, using the same formula: x1 = -1, y1 = 0 x2 = 2, y2 = 0 x3 = 0, y3 = -3 Area = 1/2 | (-1)(0 - (-3)) + 2((-3) - 0) + 0(0 - 0) | = 1/2 | (-1)(3) + 2(-3) + 0 | = 1/2 | -3 -6 + 0 | = 1/2 | -9 | = 1/2 * 9 = 4. 5 Adding both areas: 1. 5 + 4. 5 = 6 Okay, so that's the same result as before

BibTeX

@misc{zhao2025ahamomentsfakeidentifying,
      title={Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought}, 
      author={Jiachen Zhao and Yiyou Sun and Weiyan Shi and Dawn Song},
      year={2025},
      eprint={2510.24941},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.24941}, 
}
}