NSYS Trace Analysis (in progress)
Introduction
Profiling GPU behavior is an important skill for an AI engineering practitioner to possess. We ran nsys traces on RTX 4090 GPUs across a matrix of model configurations designed to stress different architectural bottlenecks. NCU profiling was also attempted, however Nvidia performance counters were not available in the containerized environments tests were run in.
Nsys nevertheless provides very fine-grained information on model performance. The configurations below were chosen to isolate specific performance characteristics: two baselines (wide-shallow vs narrow-deep), and eight stress tests targeting latency, alignment, vocabulary, FFN width, memory bandwidth, compute throughput, and sequential depth.
| Config | Batch | Seq | d_model | num_heads | d_head | num_layers | d_ff | FFN Ratio | Design Intent |
|---|---|---|---|---|---|---|---|---|---|
| model_a | 32 | 256 | 768 | 12 | 64 | 2 | 2048 | 2.7x | Wide & shallow baseline -- large d_model, few layers |
| model_b | 32 | 256 | 384 | 12 | 32 | 12 | 1024 | 2.7x | Narrow & deep baseline -- small d_model, many layers |
| latency_bound | 1 | 128 | 512 | 8 | 64 | 12 | 1536 | 3.0x | Minimal batch -- exposes per-step kernel launch overhead, no batching benefit |
| misaligned_dims | 34 | 257 | 514 | 2 | 257 | 6 | 1538 | 3.0x | Non-power-of-2 dims -- stresses tensor core alignment, exposes padding waste |
| bad_head_size | 32 | 256 | 672 | 12 | 56 | 6 | 1792 | 2.7x | d_head=56 (not 32/64/128) -- misaligned for tensor core tile sizes |
| vocab_bottleneck | 64 | 256 | 512 | 8 | 64 | 4 | 1536 | 3.0x | Large vocab (50,257 vs 10,000) -- stresses embedding/LM-head matmuls |
| wide_ffn | 32 | 256 | 768 | 12 | 64 | 6 | 4096 | 5.3x | Very wide FFN -- SwiGLU with 3x the usual d_ff, FFN-dominated compute |
| bandwidth_bound | 256 | 256 | 384 | 6 | 64 | 4 | 1024 | 2.7x | Huge batch, small model -- memory bandwidth bound (data movement dominates) |
| compute_bound | 32 | 256 | 1536 | 24 | 64 | 8 | 4096 | 2.7x | Very large model -- compute bound (GEMM-dominated, tests peak FLOP utilization) |
| deep_sequential | 32 | 256 | 512 | 8 | 64 | 32 | 1536 | 3.0x | Very deep -- 32 layers of sequential dependency, tests pipeline bubble effects |
Resource Envelope
The table below shows computed resource estimates for each configuration under RTX 4090 AMP assumptions (float32 weights, bfloat16 activations and gradients).
Data Loading and Controls
High-Level Summary
How many events does each trace produce, and how much total GPU time do they consume? Traces with more layers (deep_sequential, model_b) naturally produce more kernel events, while bandwidth_bound's large batch size drives memory operation counts.
Resource Envelope Comparison
How do the model configurations compare in terms of estimated compute and memory requirements? The scatter below maps peak training memory against training step TFLOPs, revealing which configs are memory-limited vs compute-limited.
Alignment Diagnostics
Tensor core efficiency depends on dimension alignment. The d_head % 32 == 0 and d_ff % 64 == 0 flags below indicate which configs are well-aligned for RTX 4090 tensor core tile sizes. Misaligned configurations (bad_head_size, misaligned_dims) pay a padding tax on every matmul.
Kernel Analysis
Which GPU kernels dominate execution time? This section ranks kernels by total duration or invocation count across the selected traces.
Kernel Duration Distribution
Per-invocation duration distributions for the top kernels, faceted by trace name. Box plots reveal whether kernels have consistent or highly variable execution times.
Kernel Category Breakdown
Grouping kernels by functional category (GEMM, softmax, normalization, elementwise, memory-like, etc.) reveals the compute profile of each trace configuration. Compute-bound configs should be GEMM-dominated; bandwidth-bound configs will show more elementwise and memory overhead.
Memory Operations
Memory copy and set operations reveal data movement overhead. Configs with large batches (bandwidth_bound) or large vocabularies (vocab_bottleneck) tend to move more data; misaligned dimensions may require extra padding copies.
Bytes Transferred
Effective Bandwidth
Dividing total bytes by total duration gives an effective bandwidth metric (GB/s) for each memory operation type per trace.
Timeline Exploration
The timeline view plots GPU events over wall-clock time. In "rolled up" mode, events are bucketed to show density; in "raw" mode, individual events are shown. Look for kernel launch gaps (visible as white space between bars), memory transfer overlap with compute, and pipeline bubbles in deep architectures like deep_sequential.
Cross-Config Comparative Analysis
Finally, derived metrics help compare the GPU utilization profiles across all configurations. "Kernel density" is the ratio of total kernel execution time to trace wall-clock span -- higher means the GPU spent more of its time running kernels rather than waiting. "Memory overhead ratio" is the total memory operation time divided by total kernel time -- higher means more relative time spent moving data.
Top-5 Kernel Duration Breakdown by Trace
Small multiples showing the top-5 kernels by duration for each selected trace, enabling direct visual comparison of where each config spends its GPU time.