Architecture and Expected Performance Analysis
Introduction
This document describes the architecture and expected performance. The basic model design is a standard transformer with a core model dimension (d_model or
Model Architecture
More concretely, the model consists of:
-
an embedding layer that brings
batches of length input (token index) sequences up to to create an input and initializes the residual stream -
layers consisting of -
an RMS norm on the embedded token with
weights , -
projection of the input matrix
by matrices however
were concatenated to reduce the number of matrix multiplication operatoins. These were then split into the individual matrices for each head. -
the
and matrices have the RoPE transform applied, with each rotation block applied to dimension pairs where
is the token position and indexes dimension pairs. This implementation conserves memory at the cost of aditional compute (see below) -
scaled dot product attention is performed
with causal masking applied to the
matrix -
the attention outputs are concatenated and multiplied by output projectionwieight matrix
-
the result is added to the residual stream and an RMS norm is applied
-
this is then put through a SWiGLU feedforward layer (
) where
is the sigmoid function and is componentwise or Hadamard product
-
-
after
of these layers, the result is again added to the residual stream and an rms norm is applied -
the result is projected up the the vocabulary dimenion by the language head, producing the output logits
Loss Function, Optimizer, Scheduler, and Tranining Details
Each token in the batch of sequences is used to predict the next token probabilities by converting the logits via softmax. the cross-entropy loss
is then caclulated based on the actual next-token values. this loss is then backpropagated, with the trainable parameters updated via the adamw optimizer, which adds decoupled weight decay to the adam optimizer.
The optimizer takes a learning rate parameter which was generated by a cosine-annealing learning rate scheduler returns the minimum up to maximum learning rates linearly during its warmup iterations, then returns learning rates decreasing according to a cosine schedule back to the minimum value for the final iteration.
The model was trained in this fashion with evaluation losses and perplexity measured at intervals, and checkpoints generated of the latest and best-evaluating parameters.
An important note regarding the optimization flow is that the graidents are subjected to norm clipping. For analysis, unclipped norms are often examined as they reflect the actual magnitude of the gradients, but clipping plays a key role in stabilizing training, such as under learning rates that would otherwise cause divergence.
Expected Performance
Platforms
Two platforms were used, an Apple M4 base model with 10 GPU cores and 24GB of unified RAM, and the Nvidia RTX 4090, which also has 24GB of RAM available. The M4 was used for local development, testing, and exploration, then the debugged models were run on the 4090 to explore performance differences, ..., and for a series of tests stressing the hardware for the sake of examining Nsys (profiling) traces.
Here is a summary of the systems' differences:
| Feature | Apple M4 (10-core GPU, 24GB unified) | NVIDIA RTX 4090 |
|---|---|---|
| Architecture | Apple Silicon (ARM + integrated GPU) | Ada Lovelace (AD102) |
| GPU Cores | 10 integrated GPU cores | 16,384 CUDA cores |
| Tensor Units | Apple AMX / GPU matrix units | 4th-gen Tensor Cores |
| VRAM / Memory | 24 GB unified LPDDR5 | 24 GB GDDR6X (dedicated) |
| Memory Bandwidth | ~120 GB/s (shared) | ~1,008 GB/s |
| Training Precision | FP32 (for stability) | AMP (automatic mixed precision) |
| Software Stack | PyTorch MPS, Metal | CUDA, cuDNN, TensorRT |
| Compilation | No | Yes (Inductor) |
| FP32 Throughput | ~3–4 TFLOPS (est., GPU) | ~82 TFLOPS |
| FP16 / BF16 | Accelerated (unstable) | ~330 TFLOPS (Tensor Core) |
| INT8 | Limited acceleration | ~660+ TOPS (Tensor Core) |
| Power Draw | ~20–30W typical | ~450W peak |
| Multi-GPU Scaling | No | Yes (NVLink not on 4090, but multi-GPU via PCIe) |
| Primary Strength | Efficiency, portability | Raw training throughput |
While the available RAM on both systems is the same, the memory bandwidth differ by nearly an oder of magnitude. Additionally, the RTX's automatic mixed precision significantly reduces the amount of memory needed, as the model activations and gradients can be stored in the bfloat16 format, while the M4 requires full float32 for numerical stabiity.
The RTX enjoys similar benefits in compute, as well. Not only is the base float32 throughput 20x higher, the software stack also supports torch.compile(), which does a good job of producing optimized CUDA code via the Inductor backend. MPS does not compile, causing code to be less optimized.
Calculating Performance Expectations
Overall this is a standard transformer architecture with well-characterized performance. Since the point of the exercise was to implement it from scratch, there are a variety of optimizations that were left on the table, such as gradient checkpointing, or a more efficient code path for gradient clipping and reporting the gradient norm during training. For a deeper dive into the actual performance of this architecture under various settings, please continue to Benchmarks and Empirical Performance Analysis.