Learning Rate Sweep Analysis
Introduction
For the parameter sweep exercise, the learning rate was chosen as a target because it can be varied without altering more significant training performance aspects of the model such as dimensions. It also provides an opportunity to observe training runs that are unstable. The general goal was to find the higest maximum learning rate that exhibited stable training behavior.
This exercise also provided an opportunity to become familiar with the WandB sweeps api, which provides easy access to advanced features such as hyperband early stopping and Bayesian parameter search schedulers.
To review, the model's AdamW optimizer can accept a learning rate parameter which provides a bulk control to the size of the model parameter update. This learning rate is set during training by a learning rate scheduler routine, which executes a linear warmup from 0 to the maximum LR over the input warmup iterations, then goes from the maximum LR to the minimum according to a cosine schedule over the remaining training iterations.
As such, the maximum learning rate is the key variable. The assignment default had this at
Initial Sweep Configuration and Summary
An initial set of seeps were set to run for a maximum of 1000 iterations with 100 iterations warmup. Evaluation metrics such as eval loss were collected every 50 iterations, with early termination available after 250 iterations.
Early termination was handled by the Hyperband algorithm, however due to an inconsistency in the way number of iterations are tracked, with one eval loss only available for every 50 iterations, the early termination did not kick in.
Nonetheless, 23 sweeps completed. Here is a summary:
The outlier in run 4 jumps out. This appears to have displayed a nearly-monotonic decline in both training loss and eval loss, and stabilized to a far lower eval loss than any other run. We confirmed that our run settings, with should look at the last eval loss value available in the sweep and try to minimize it in its selection of the next run.
Although the run 4 learning rate value is close to the bottom of the range, it is still within the range of allowed values, the use of a uniform sampling distribution instead of log-uniform caused values in this lower region to be underrepresented.
Nonetheless, both the assignment handout and our research indicate that a maximum learning rate "at the edge of stability" is ideal due to a "catapult" effect that pushes parameters into a "wide", "general" valley of the loss landscape, so instead of treating the sweeps as finding difinitive max LR's, we look at each iteration's step-level LR and see what the corresponding gradient norm was, in order to orient ourselves towards the region with the desired almost-unstable characteristics.
Based on this analysis, it seems peaklearning rates in the range of
The spikes between
We want a composite loss function that requires an increase in loss near the max LR and otherwise looks for low eval loss. First, this means more data points, so tighetining the eval interval to once every 10 iterations. Then we can create a rule, first to look at a rolling average of say 5
and sum these to create the metric for the optimizer to try to minimize, which is pretty simple, and leaves it to the hyperband setup to manage early stopping, which we will want to be aggressive in order to weed out runs that don't display the desired initial increase in loss. The concern becomes that, for the surviving runs, we want to more heavily weight the eval losses at the end of the run, and so we should add an exponential weighting factor
Which upon implementation was amended to use
which raises the question, what's a good
| κ | last/first (κ^194) | Interpretation |
|---|---|---|
| 1.01 | ~6.9x | Mild end-weighting -- warmup signal still meaningful for Hyperband |
| 1.015 | ~18x | Moderate -- warmup matters for Hyperband, final perf dominates total |
| 1.02 | ~47x | Firm tilt toward end -- last ~40 eval points dominate |
| 1.03 | ~310x | Aggressive -- warmup is noise, only last ~30 points matter |
| 1.05 | ~13,000x | Very aggressive -- effectively only the final handful of points count |
| 1.10 | ~10^8x | Extreme -- only the very last point matters |
ultimately preferring
We can then config hyperband with an eta of 2 and a min_iter of 25, which for the wandb hyperband implementation is 25 eval iterations, giving killpoints at 25, 50, 100, and 200 iterations.
Second Sweep Configuration and Summary
We then experimented with a larger set of sweeps to assess
- whether the eval metric
effectively encourages survival of runs that display initially increasing loss - how the Bayesian algorithm implements an "explore/exploit" approach to searching the hyperparmeter space
- which hyperparameter values are optimal
Bayesian Sweep Analysis
A few additional hyperparameters were considered for the second phase of sweeps. Based on our research, we found that the gradient norm clipping is almost always kept at 1 and that the AdamW optimizer's
, from 0.95 to 0.999 in a uniform distribution, - the weight decay parameter
from 0.001 to 0.1 in a log-uniform distribution, - and of course the maximum learning rate, from
to , log-uniform.
Viewing this dynamically, we can see that the search tended towards decent amounts of "exploit" when it found a well-functioning combination of hyperparameters, while as it got into the "excessive" zone of ~300 runs it began to more aggressively explore the edges of its allowed distributions.
We can also look at k-means clusters for runs that survived to the end of training:
Ultimately, we see that higher max learning rates did not directly lead to lower overall eval losses. We suspect this may be due to the relatively smaller model having a simpler loss function geometry, making extensive "catapulting" less necessary.
What we do see, when we examine the runs with the lowest overall eval losses, is that there is an initial flatness in the eval loss during warmup, then a steep drop, followed by a slight convexity in the curve. This suggests there is a level of loss space exploration taking place as the learning rate ramps up in these well-performing runs.
Assessment of Sweep Metric
A fully controlled assessment of the constructed metric
One thing we can do is run a rank correlation analysis on the
A visual approach is to see which runs above or below median eval loss or $Mu$ metric survived:
This shows that runs with a an eval loss below median did often at iteration 125 have a
Conclusion
While experimenting with the wandb infrastructure, Bayesian algorithms, and Hyperband early stopping was interesting, and our
| Hyperparameter | Assignment default | Final selected value |
|---|---|---|
Here is the eval loss chart for sweep 148:
These settings were used to train the model weights used for inference, to good effect. Please continue to the discussion of ablations.