Ablation Analysis
Introduction
Ablations are structural or architectural variations of a base model. In the context of the source assignment, these were intended to test some foundational modifications to the transformer architecture. We'll walk through these variations, describe what they entail and what their expected effect is, and then proceed to examining the empirical results from our training runs.
- Post-Norm: The post-norm ablation moves the RMSNorm within the transformer block from before the attention computation (pre-norm) to after it—specifically, between the attention calculation and the feedforward step. This is expected to slightly degrade model stability by exposing the attention mechanism to un-normalized variance, but it shouldn't have a catastrophic effect on overall convergence for smaller models.
- SiLU: The SiLU ablation replaces the more complex SwiGLU activation function with a standard SiLU. That is, it removes the "gating" componentwise product and its associated weight matrix, relying solely on the "Swish" function,
. Because this simplifies the feedforward network and reduces the overall parameter count, it provides fewer "levers" for backpropagation to pull. Consequently, we expect this to slightly decrease the convergence rate and lead to a higher final loss. - NoPE: A fun name for excluding the RoPE (Rotary Position Embedding) mechanism altogether. Given our models are trained with a relatively short sequence length of 256, it's an open question how much impact the lack of explicit positional information will have. Nevertheless, we generally expect increased losses since the model must infer positional relationships purely from context.
- No Norm: This ablation removes the RMS Norm steps from the computation entirely. This variation is expected to have the most dramatic impact, as the "norm early, norm often" heuristic has become the rule of thumb for a reason: consistent conditioning of the residual stream plays a central role in preventing exploding gradients. We expect the model training to be highly unstable, potentially leading to a complete crash.
Results
Here is a set of charts showing all tests:
Let's dig in to each case.
Post-Norm
As expected, in our small model post-norm has little impact on the final loss; however, we see a persistently higher gradient norm relative to the baseline. This ablation retained the final RMS Norm before the "language head" up-projection, so the observed difference is likely due to the un-normalized inputs entering the attention mechanism. This naturally increases the scale of the gradients—a direct consequence of having larger variances flowing into heavy computations that are tracked by the auto-differentiation compute graph.
SiLU
As expected, the SiLU ablation displays noticeably higher losses compared to the baseline, while the gradient norm remains relatively low and stable. The lack of a gating mechanism restricts the network's expressivity, confirming that the additional parameters in SwiGLU meaningfully contribute to the model's ability to minimize loss.
NoPE
This is probably the most interesting ablation. We might initially expect relatively minor changes to the observed metrics given the small sequence length. However, we see that losses start to climb higher than even the non-normed run by iteration 300, and ultimately remain higher than all other ablations throughout the rest of the run. Similarly, the gradient norm is elevated compared to other pre-normed runs, suggesting that without rotary embeddings, the attention mechanism struggles to efficiently route information, leading to less stable and less effective training.
No Norm
Removing normalization entirely, unsurprisingly, caused the training to destabilize and crash. Around iteration 600, the model experiences a small spike in the gradient norm, followed by a larger spike, and then a massive spike, after which the loss (technically, the perplexity) experienced a math range overflow and the run crashed. This catastrophic failure is clearly reflected in the loss metrics, which display a very sharp, irrecoverable spike, illustrating the necessity of layer normalization, and matrix conditioning more generally, for stable deep learning.
Conclusion
Most of our empirical results aligned well with theoretical expectations. The most interesting finding was from the NoPE (No Positional Encoding) ablation, which demonstrated a significant and persistent increase in loss during the asymptotic phase of training. This strongly suggests that the relative positional information provided by the RoPE mechanism plays a crucial role in allowing the attention mechanism to properly organize and process sequence information throughout the training process, even for short sequence lengths.
Furthermore, the spectacular failure of the No Norm ablation serves as a clear illustration of why normalization techniques like RMSNorm are ubiquitous in modern transformer architectures. Without them, the delicate balance of forward activations and backward gradients quickly spirals out of control, highlighting the importance of persistent residual stream conditioning in preventing exploding gradients. Ultimately, these ablations validate the standard architectural choices in modern LLMs, proving that each component—from gating mechanisms to positional encodings and layer normalization—serves a distinct and necessary purpose for stable and efficient training.