Ablation Analysis

Introduction

Ablations are structural or architectural variations of a base model. In the context of the source assignment, these were intended to test some foundational modifications to the transformer architecture. We'll walk through these variations, describe what they entail and what their expected effect is, and then proceed to examining the empirical results from our training runs.

Results

Here is a set of charts showing all tests:

Let's dig in to each case.

Post-Norm

As expected, in our small model post-norm has little impact on the final loss; however, we see a persistently higher gradient norm relative to the baseline. This ablation retained the final RMS Norm before the "language head" up-projection, so the observed difference is likely due to the un-normalized inputs entering the attention mechanism. This naturally increases the scale of the gradients—a direct consequence of having larger variances flowing into heavy computations that are tracked by the auto-differentiation compute graph.

SiLU

As expected, the SiLU ablation displays noticeably higher losses compared to the baseline, while the gradient norm remains relatively low and stable. The lack of a gating mechanism restricts the network's expressivity, confirming that the additional parameters in SwiGLU meaningfully contribute to the model's ability to minimize loss.

NoPE

This is probably the most interesting ablation. We might initially expect relatively minor changes to the observed metrics given the small sequence length. However, we see that losses start to climb higher than even the non-normed run by iteration 300, and ultimately remain higher than all other ablations throughout the rest of the run. Similarly, the gradient norm is elevated compared to other pre-normed runs, suggesting that without rotary embeddings, the attention mechanism struggles to efficiently route information, leading to less stable and less effective training.

No Norm

Removing normalization entirely, unsurprisingly, caused the training to destabilize and crash. Around iteration 600, the model experiences a small spike in the gradient norm, followed by a larger spike, and then a massive spike, after which the loss (technically, the perplexity) experienced a math range overflow and the run crashed. This catastrophic failure is clearly reflected in the loss metrics, which display a very sharp, irrecoverable spike, illustrating the necessity of layer normalization, and matrix conditioning more generally, for stable deep learning.

Conclusion

Most of our empirical results aligned well with theoretical expectations. The most interesting finding was from the NoPE (No Positional Encoding) ablation, which demonstrated a significant and persistent increase in loss during the asymptotic phase of training. This strongly suggests that the relative positional information provided by the RoPE mechanism plays a crucial role in allowing the attention mechanism to properly organize and process sequence information throughout the training process, even for short sequence lengths.

Furthermore, the spectacular failure of the No Norm ablation serves as a clear illustration of why normalization techniques like RMSNorm are ubiquitous in modern transformer architectures. Without them, the delicate balance of forward activations and backward gradients quickly spirals out of control, highlighting the importance of persistent residual stream conditioning in preventing exploding gradients. Ultimately, these ablations validate the standard architectural choices in modern LLMs, proving that each component—from gating mechanisms to positional encodings and layer normalization—serves a distinct and necessary purpose for stable and efficient training.