Architecture and Expected Performance Analysis

Introduction

This document describes the architecture and expected performance. The basic model design is a standard transformer with a core model dimension (d_model or ) which the attention heads subdivide such that each layer's number of heads or , the head dimension, followed by a feedfoward layer that is up-projected to and down-projected back to , in order to add the results of the layer back into the residual stream. This is repeated for layers and the output is projected up to provide logits across the vocabulary elements, which are then used for inference.

Model Architecture

More concretely, the model consists of:

an embedding layer that brings batches of length input (token index) sequences up to to create an input and initializes the residual stream
layers consisting of
- an RMS norm on the embedded token with weights ,
- projection of the input matrix by matrices
  
  however were concatenated to reduce the number of matrix multiplication operatoins. These were then split into the individual matrices for each head.
- the and matrices have the RoPE transform applied, with each rotation block applied to dimension pairs
  
  where is the token position and indexes dimension pairs. This implementation conserves memory at the cost of aditional compute (see below)
- scaled dot product attention is performed
  
  with causal masking applied to the matrix
- the attention outputs are concatenated and multiplied by output projectionwieight matrix
- the result is added to the residual stream and an RMS norm is applied
- this is then put through a SWiGLU feedforward layer ()
  
  where is the sigmoid function and is componentwise or Hadamard product
after of these layers, the result is again added to the residual stream and an rms norm is applied
the result is projected up the the vocabulary dimenion by the language head, producing the output logits

Loss Function, Optimizer, Scheduler, and Tranining Details

Each token in the batch of sequences is used to predict the next token probabilities by converting the logits via softmax. the cross-entropy loss

is then caclulated based on the actual next-token values. this loss is then backpropagated, with the trainable parameters updated via the adamw optimizer, which adds decoupled weight decay to the adam optimizer.

The optimizer takes a learning rate parameter which was generated by a cosine-annealing learning rate scheduler returns the minimum up to maximum learning rates linearly during its warmup iterations, then returns learning rates decreasing according to a cosine schedule back to the minimum value for the final iteration.

The model was trained in this fashion with evaluation losses and perplexity measured at intervals, and checkpoints generated of the latest and best-evaluating parameters.

An important note regarding the optimization flow is that the graidents are subjected to norm clipping. For analysis, unclipped norms are often examined as they reflect the actual magnitude of the gradients, but clipping plays a key role in stabilizing training, such as under learning rates that would otherwise cause divergence.

Expected Performance

Platforms

Two platforms were used, an Apple M4 base model with 10 GPU cores and 24GB of unified RAM, and the Nvidia RTX 4090, which also has 24GB of RAM available. The M4 was used for local development, testing, and exploration, then the debugged models were run on the 4090 to explore performance differences, ..., and for a series of tests stressing the hardware for the sake of examining Nsys (profiling) traces.

Here is a summary of the systems' differences:

Feature	Apple M4 (10-core GPU, 24GB unified)	NVIDIA RTX 4090
Architecture	Apple Silicon (ARM + integrated GPU)	Ada Lovelace (AD102)
GPU Cores	10 integrated GPU cores	16,384 CUDA cores
Tensor Units	Apple AMX / GPU matrix units	4th-gen Tensor Cores
VRAM / Memory	24 GB unified LPDDR5	24 GB GDDR6X (dedicated)
Memory Bandwidth	~120 GB/s (shared)	~1,008 GB/s
Training Precision	FP32 (for stability)	AMP (automatic mixed precision)
Software Stack	PyTorch MPS, Metal	CUDA, cuDNN, TensorRT
Compilation	No	Yes (Inductor)
FP32 Throughput	~3–4 TFLOPS (est., GPU)	~82 TFLOPS
FP16 / BF16	Accelerated (unstable)	~330 TFLOPS (Tensor Core)
INT8	Limited acceleration	~660+ TOPS (Tensor Core)
Power Draw	~20–30W typical	~450W peak
Multi-GPU Scaling	No	Yes (NVLink not on 4090, but multi-GPU via PCIe)
Primary Strength	Efficiency, portability	Raw training throughput

While the available RAM on both systems is the same, the memory bandwidth differ by nearly an oder of magnitude. Additionally, the RTX's automatic mixed precision significantly reduces the amount of memory needed, as the model activations and gradients can be stored in the bfloat16 format, while the M4 requires full float32 for numerical stabiity.

The RTX enjoys similar benefits in compute, as well. Not only is the base float32 throughput 20x higher, the software stack also supports torch.compile(), which does a good job of producing optimized CUDA code via the Inductor backend. MPS does not compile, causing code to be less optimized.

Calculating Performance Expectations

Overall this is a standard transformer architecture with well-characterized performance. Since the point of the exercise was to implement it from scratch, there are a variety of optimizations that were left on the table, such as gradient checkpointing, or a more efficient code path for gradient clipping and reporting the gradient norm during training. For a deeper dive into the actual performance of this architecture under various settings, please continue to Benchmarks and Empirical Performance Analysis.