Benchmarks and Empirical Performance Analysis

Introduction

Benchmarking work in this project was centered around pedagogic goals, which were twofold. First, was simply gaining familiarity with benchmarking tools. To this end, the PyTorch blocked_autorange method from the benchmark.Timer class was initially used to execute a controlled, large-scale grid search across core model parameters (see below).

These results were then used to generate more robustly estimate model specifications that could be run on the avialable hardware, culminating in two model settings being selected for comparison, one "wide" and another "deep", a relativley standard kind of comparison. These two models had similar numbers of parameters and compute budgets, but differed on memory use.

After training on the Apple M4, the two models were then trained on the RTX 4090. This analysis concludes with a comparison of training performance between these two systems, after which we provide sample output from the final model weights.

Benchmark Grid Search

The first step in benchmarking was a grid search across core models parameters, including

bach size
sequence length
model dimension
head dimension or number of heads (with , and locked to multiples of 32)
number of layers/blocks
SwiGLU network feedforward dimension

As the benchmark.Timer class primarily measures time and throughput, those are the metrics collected. The blocked_autorange method was used on a single training loop iteration, with the expectation that this would result in reliable and repeatable results. While this is true in the sense that the method does ensure that the memory space is clean and that the device has been synchronized, part of the method's reliability comes from the repeated execution of the code under study, and in this case the minimum time allotted was often lower than the time of a single run. Nonetheless, the data were consistent with expectations, as can be seen below.

Observations

smaller lead to higher tokens/sec
larger sequence lengths quickly exceed memory capacity due to the attention's scaling
in practice, this chart was useful for finding interesting and less-controlled comparison models, as batch size can be varied along with other parameters to find diverse models that have similar training memory budgets

Expected vs Actual Performance

It's important to note that these are "expected" in the sense of "calculated based on listed system TFLOPs" rather than any more nuanced estimation. As such, performance is always worse than "expected", with the dashed line on the following charts reflecting the point where actual and expected would have been equal.

Observations

larger models display nonlinear slowdowns, suggesting the expected effect of memory bottlenecking does occur
even more significant memory issues arise in large models, which see an explosive growth in observed iteration time when the system becomes stressed
increased batch size does in fact show increased training throughput, validating the hypothesis that kenrel launches are an additional bottleneck on MPS

Conclusions

The observations broadly reflects expectations. The modifier "broadly" is used because there is a persistent level of uncertainty associated with Apple Silicon and the MacOS/MPS: first, and that even with the controls of the PyTorch benchmark library, the Mac is operating as a desktop system, and has numerous processes that can lead to resource contention unpredictably. We did leave the system to run these benchmarks overnight, turned off backups, and closed all apps, so the data emerged relatively clean.

Returning to core questions of traning dynamics, the main question based on research into the system is whether it is heavily memory-bottlenecked preventing larger models from being trained, or does it suffers from a poorly optimized stack that requires numerous MPS kernel launces, increasing overhead and causing increased batch size to

In fact both are true, so increasing batch size does lead to improved throughput, while larger models do bog down the system and show increasingly sub-optimal performance.

Model Selection

Ultimately, in order to provide a robust comparison while remaining within memory limits, two model specs were chosen, one of which was actually outside the grid seach as its width made the feasible number of layers smaller than the minimum in the grid.

The decision to use these models was reached after an attempt at training larger models took excessive time. Those models were also varying across a variety of parameters, which is interesting for a more intuitive approach to research, but a more controlled "deep vs wide" comparison is more appropriate for this project.

So the two models selected were matched on many parameters, with the choice made to keep head count equal and change the head dimension, as well as model and feedforward dimensions. To compensate for the differnce in width, the model size estimator was used to select a number of layers for the widemodel that would result in an equivalent parameter count.

Overall model training memory could have been more controlled with the use of gradient checkpointing during training, which would have significantly alleviated the memory consumption caused by tracking activation values across many layers, however this was not part of the assignment spec and the models were sized to fit comfortably in system memory, making this unnecessary.

Based on existing literature, it was expected that model A would converge to its minimum loss more quickly due to having fewer layers, but would also probably have a higher minimum loss as the architecture can only learn a relatively flat manifold. for the same reason.

Training Dyanmics Across Architectures

Training Throughput Comparison

As expected, the RTS 4090 showed a significant speedup of about 20x for the wide model A, and 16x for the deep model B. This is suprirsingly close to the 20x speedup expected for float32 operations on the RTX, with the reduced performance in the deep model likely caused by memory latency.

Interestingly, the iteration times for the deep model on MPS show a period of increased step time, probably due to MacOS housekeeping that was triggered automatically. This could be a memory management artifact, as there are additional latency spikes, probably due to memory pressure, even though the model on paper should fit comfortably.

Training Curve Analysis

Training shows the expected dynamics, where the wide model A converges more quickly but to a higher steady-state loss, while the deep model B converges more slowly to lower steady state. The pre-clipping gradient norm is consistent with this dynamic, with the norm for the deep model remaining elevated for about 20 cycles longer than the wide model.

Conclusions

Overall, this investigation successfully highlighted the practical constraints and trade-offs in LLM training. The hardware comparison showed a clear advantage for the RTX 4090, which delivered a consistent ~20x speedup over the Apple M4, effectively matching theoretical expectations for float32 performance. While the M4 was certainly functional for smaller-scale experimentation, the analysis also highlighted its susceptibility to memory management latency and kernel launch overheads, particularly with deeper network architectures.

On the architectural front, the controlled comparison between "wide" and "deep" models confirmed established intuitions: the wider, shallower model offered rapid convergence but limited capacity, while the deeper model achieved a lower steady-state loss at the cost of longer training horizons. Taken together, these results show that while consumer silicon can support initial development, scaling model depth and performance requires both specialized hardware and careful architectural tuning.

Please continue to the Optimizer Sweeps page.