NSYS Trace Analysis (in progress)

Introduction

Profiling GPU behavior is an important skill for an AI engineering practitioner to possess. We ran nsys traces on RTX 4090 GPUs across a matrix of model configurations designed to stress different architectural bottlenecks. NCU profiling was also attempted, however Nvidia performance counters were not available in the containerized environments tests were run in.

Nsys nevertheless provides very fine-grained information on model performance. The configurations below were chosen to isolate specific performance characteristics: two baselines (wide-shallow vs narrow-deep), and eight stress tests targeting latency, alignment, vocabulary, FFN width, memory bandwidth, compute throughput, and sequential depth.

Config	Batch	Seq	d_model	num_heads	d_head	num_layers	d_ff	FFN Ratio	Design Intent
model_a	32	256	768	12	64	2	2048	2.7x	Wide & shallow baseline -- large d_model, few layers
model_b	32	256	384	12	32	12	1024	2.7x	Narrow & deep baseline -- small d_model, many layers
latency_bound	1	128	512	8	64	12	1536	3.0x	Minimal batch -- exposes per-step kernel launch overhead, no batching benefit
misaligned_dims	34	257	514	2	257	6	1538	3.0x	Non-power-of-2 dims -- stresses tensor core alignment, exposes padding waste
bad_head_size	32	256	672	12	56	6	1792	2.7x	d_head=56 (not 32/64/128) -- misaligned for tensor core tile sizes
vocab_bottleneck	64	256	512	8	64	4	1536	3.0x	Large vocab (50,257 vs 10,000) -- stresses embedding/LM-head matmuls
wide_ffn	32	256	768	12	64	6	4096	5.3x	Very wide FFN -- SwiGLU with 3x the usual d_ff, FFN-dominated compute
bandwidth_bound	256	256	384	6	64	4	1024	2.7x	Huge batch, small model -- memory bandwidth bound (data movement dominates)
compute_bound	32	256	1536	24	64	8	4096	2.7x	Very large model -- compute bound (GEMM-dominated, tests peak FLOP utilization)
deep_sequential	32	256	512	8	64	32	1536	3.0x	Very deep -- 32 layers of sequential dependency, tests pipeline bubble effects

Resource Envelope

The table below shows computed resource estimates for each configuration under RTX 4090 AMP assumptions (float32 weights, bfloat16 activations and gradients).

Data Loading and Controls

High-Level Summary

How many events does each trace produce, and how much total GPU time do they consume? Traces with more layers (deep_sequential, model_b) naturally produce more kernel events, while bandwidth_bound's large batch size drives memory operation counts.

Resource Envelope Comparison

How do the model configurations compare in terms of estimated compute and memory requirements? The scatter below maps peak training memory against training step TFLOPs, revealing which configs are memory-limited vs compute-limited.

Alignment Diagnostics

Tensor core efficiency depends on dimension alignment. The d_head % 32 == 0 and d_ff % 64 == 0 flags below indicate which configs are well-aligned for RTX 4090 tensor core tile sizes. Misaligned configurations (bad_head_size, misaligned_dims) pay a padding tax on every matmul.

Kernel Analysis

Which GPU kernels dominate execution time? This section ranks kernels by total duration or invocation count across the selected traces.

Kernel Duration Distribution

Per-invocation duration distributions for the top kernels, faceted by trace name. Box plots reveal whether kernels have consistent or highly variable execution times.

Kernel Category Breakdown

Grouping kernels by functional category (GEMM, softmax, normalization, elementwise, memory-like, etc.) reveals the compute profile of each trace configuration. Compute-bound configs should be GEMM-dominated; bandwidth-bound configs will show more elementwise and memory overhead.

Memory Operations

Memory copy and set operations reveal data movement overhead. Configs with large batches (bandwidth_bound) or large vocabularies (vocab_bottleneck) tend to move more data; misaligned dimensions may require extra padding copies.

Bytes Transferred

Effective Bandwidth

Dividing total bytes by total duration gives an effective bandwidth metric (GB/s) for each memory operation type per trace.

Timeline Exploration

The timeline view plots GPU events over wall-clock time. In "rolled up" mode, events are bucketed to show density; in "raw" mode, individual events are shown. Look for kernel launch gaps (visible as white space between bars), memory transfer overlap with compute, and pipeline bubbles in deep architectures like deep_sequential.

Cross-Config Comparative Analysis

Finally, derived metrics help compare the GPU utilization profiles across all configurations. "Kernel density" is the ratio of total kernel execution time to trace wall-clock span -- higher means the GPU spent more of its time running kernels rather than waiting. "Memory overhead ratio" is the total memory operation time divided by total kernel time -- higher means more relative time spent moving data.

Top-5 Kernel Duration Breakdown by Trace

Small multiples showing the top-5 kernels by duration for each selected trace, enabling direct visual comparison of where each config spends its GPU time.