Architecture and Expected Performance Analysis

Introduction

This document describes the architecture and expected performance. The basic model design is a standard transformer with a core model dimension (d_model or ) which the attention heads subdivide such that each layer's number of heads or , the head dimension, followed by a feedfoward layer that is up-projected to and down-projected back to , in order to add the results of the layer back into the residual stream. This is repeated for layers and the output is projected up to provide logits across the vocabulary elements, which are then used for inference.

Model Architecture

More concretely, the model consists of:

Loss Function, Optimizer, Scheduler, and Tranining Details

Each token in the batch of sequences is used to predict the next token probabilities by converting the logits via softmax. the cross-entropy loss

is then caclulated based on the actual next-token values. this loss is then backpropagated, with the trainable parameters updated via the adamw optimizer, which adds decoupled weight decay to the adam optimizer.

The optimizer takes a learning rate parameter which was generated by a cosine-annealing learning rate scheduler returns the minimum up to maximum learning rates linearly during its warmup iterations, then returns learning rates decreasing according to a cosine schedule back to the minimum value for the final iteration.

The model was trained in this fashion with evaluation losses and perplexity measured at intervals, and checkpoints generated of the latest and best-evaluating parameters.

An important note regarding the optimization flow is that the graidents are subjected to norm clipping. For analysis, unclipped norms are often examined as they reflect the actual magnitude of the gradients, but clipping plays a key role in stabilizing training, such as under learning rates that would otherwise cause divergence.

Expected Performance

Platforms

Two platforms were used, an Apple M4 base model with 10 GPU cores and 24GB of unified RAM, and the Nvidia RTX 4090, which also has 24GB of RAM available. The M4 was used for local development, testing, and exploration, then the debugged models were run on the 4090 to explore performance differences, ..., and for a series of tests stressing the hardware for the sake of examining Nsys (profiling) traces.

Here is a summary of the systems' differences:

Feature Apple M4 (10-core GPU, 24GB unified) NVIDIA RTX 4090
Architecture Apple Silicon (ARM + integrated GPU) Ada Lovelace (AD102)
GPU Cores 10 integrated GPU cores 16,384 CUDA cores
Tensor Units Apple AMX / GPU matrix units 4th-gen Tensor Cores
VRAM / Memory 24 GB unified LPDDR5 24 GB GDDR6X (dedicated)
Memory Bandwidth ~120 GB/s (shared) ~1,008 GB/s
Training Precision FP32 (for stability) AMP (automatic mixed precision)
Software Stack PyTorch MPS, Metal CUDA, cuDNN, TensorRT
Compilation No Yes (Inductor)
FP32 Throughput ~3–4 TFLOPS (est., GPU) ~82 TFLOPS
FP16 / BF16 Accelerated (unstable) ~330 TFLOPS (Tensor Core)
INT8 Limited acceleration ~660+ TOPS (Tensor Core)
Power Draw ~20–30W typical ~450W peak
Multi-GPU Scaling No Yes (NVLink not on 4090, but multi-GPU via PCIe)
Primary Strength Efficiency, portability Raw training throughput

While the available RAM on both systems is the same, the memory bandwidth differ by nearly an oder of magnitude. Additionally, the RTX's automatic mixed precision significantly reduces the amount of memory needed, as the model activations and gradients can be stored in the bfloat16 format, while the M4 requires full float32 for numerical stabiity.

The RTX enjoys similar benefits in compute, as well. Not only is the base float32 throughput 20x higher, the software stack also supports torch.compile(), which does a good job of producing optimized CUDA code via the Inductor backend. MPS does not compile, causing code to be less optimized.

Calculating Performance Expectations

Overall this is a standard transformer architecture with well-characterized performance. Since the point of the exercise was to implement it from scratch, there are a variety of optimizations that were left on the table, such as gradient checkpointing, or a more efficient code path for gradient clipping and reporting the gradient norm during training. For a deeper dive into the actual performance of this architecture under various settings, please continue to Benchmarks and Empirical Performance Analysis.