Learning Rate Sweep Analysis

Introduction

For the parameter sweep exercise, the learning rate was chosen as a target because it can be varied without altering more significant training performance aspects of the model such as dimensions. It also provides an opportunity to observe training runs that are unstable. The general goal was to find the higest maximum learning rate that exhibited stable training behavior.

This exercise also provided an opportunity to become familiar with the WandB sweeps api, which provides easy access to advanced features such as hyperband early stopping and Bayesian parameter search schedulers.

To review, the model's AdamW optimizer can accept a learning rate parameter which provides a bulk control to the size of the model parameter update. This learning rate is set during training by a learning rate scheduler routine, which executes a linear warmup from 0 to the maximum LR over the input warmup iterations, then goes from the maximum LR to the minimum according to a cosine schedule over the remaining training iterations.

As such, the maximum learning rate is the key variable. The assignment default had this at , with the minimum going down to . For the sweeps, we configured maximum rates to fall in a range between and . In retrospect, since the maximum learning rate was stable we could have used this as a floor, but since the sweeps were running locally overnight there was plenty of chance for the Bayesian algorithm to search in that area.

Initial Sweep Configuration and Summary

An initial set of seeps were set to run for a maximum of 1000 iterations with 100 iterations warmup. Evaluation metrics such as eval loss were collected every 50 iterations, with early termination available after 250 iterations.

Early termination was handled by the Hyperband algorithm, however due to an inconsistency in the way number of iterations are tracked, with one eval loss only available for every 50 iterations, the early termination did not kick in.

Nonetheless, 23 sweeps completed. Here is a summary:

The outlier in run 4 jumps out. This appears to have displayed a nearly-monotonic decline in both training loss and eval loss, and stabilized to a far lower eval loss than any other run. We confirmed that our run settings, with should look at the last eval loss value available in the sweep and try to minimize it in its selection of the next run.

Although the run 4 learning rate value is close to the bottom of the range, it is still within the range of allowed values, the use of a uniform sampling distribution instead of log-uniform caused values in this lower region to be underrepresented.

Nonetheless, both the assignment handout and our research indicate that a maximum learning rate "at the edge of stability" is ideal due to a "catapult" effect that pushes parameters into a "wide", "general" valley of the loss landscape, so instead of treating the sweeps as finding difinitive max LR's, we look at each iteration's step-level LR and see what the corresponding gradient norm was, in order to orient ourselves towards the region with the desired almost-unstable characteristics.

Based on this analysis, it seems peaklearning rates in the range of to might be good. However, we should choose based on the impact the learning rate will actuall have on our model, which depends on the gradient norm clipping.

The spikes between and are due to the lack of learning rate datapoints, skewing the ratios. The range from and seems promising, making the relative lack of datapoints there more unfortunate. Let's run another round of sweeps, this time including other optmizer parameters. But first, let's think of a metric that will deliver the behavior we want, so we can feed something useful into hyperband.

We want a composite loss function that requires an increase in loss near the max LR and otherwise looks for low eval loss. First, this means more data points, so tighetining the eval interval to once every 10 iterations. Then we can create a rule, first to look at a rolling average of say 5 s of eval loss as , and the of the learning rate between this and the last eval iteration as . So one possibility is to track the rolling average at any iteration (in the rolling average) and watch for increasing LR:

and sum these to create the metric for the optimizer to try to minimize, which is pretty simple, and leaves it to the hyperband setup to manage early stopping, which we will want to be aggressive in order to weed out runs that don't display the desired initial increase in loss. The concern becomes that, for the surviving runs, we want to more heavily weight the eval losses at the end of the run, and so we should add an exponential weighting factor . We thus end up with our Hyperband minimization objective :

Which upon implementation was amended to use

which raises the question, what's a good ? Given sweeps of 1000 iterations with the metrics being calculated every 5 "eval iterations" and averaged over 5 eval iterations, for a total of 200 eval iterations with the eval metric not being available for the first 5, we have

κ last/first (κ^194) Interpretation
1.01 ~6.9x Mild end-weighting -- warmup signal still meaningful for Hyperband
1.015 ~18x Moderate -- warmup matters for Hyperband, final perf dominates total
1.02 ~47x Firm tilt toward end -- last ~40 eval points dominate
1.03 ~310x Aggressive -- warmup is noise, only last ~30 points matter
1.05 ~13,000x Very aggressive -- effectively only the final handful of points count
1.10 ~10^8x Extreme -- only the very last point matters

ultimately preferring ,

We can then config hyperband with an eta of 2 and a min_iter of 25, which for the wandb hyperband implementation is 25 eval iterations, giving killpoints at 25, 50, 100, and 200 iterations.

Second Sweep Configuration and Summary

We then experimented with a larger set of sweeps to assess

  1. whether the eval metric effectively encourages survival of runs that display initially increasing loss
  2. how the Bayesian algorithm implements an "explore/exploit" approach to searching the hyperparmeter space
  3. which hyperparameter values are optimal

Bayesian Sweep Analysis

A few additional hyperparameters were considered for the second phase of sweeps. Based on our research, we found that the gradient norm clipping is almost always kept at 1 and that the AdamW optimizer's value is similarly kept at 0.99 and is kept at . As such, the parameters swept were

Viewing this dynamically, we can see that the search tended towards decent amounts of "exploit" when it found a well-functioning combination of hyperparameters, while as it got into the "excessive" zone of ~300 runs it began to more aggressively explore the edges of its allowed distributions.

We can also look at k-means clusters for runs that survived to the end of training:

Ultimately, we see that higher max learning rates did not directly lead to lower overall eval losses. We suspect this may be due to the relatively smaller model having a simpler loss function geometry, making extensive "catapulting" less necessary.

What we do see, when we examine the runs with the lowest overall eval losses, is that there is an initial flatness in the eval loss during warmup, then a steep drop, followed by a slight convexity in the curve. This suggests there is a level of loss space exploration taking place as the learning rate ramps up in these well-performing runs.

Assessment of Sweep Metric

A fully controlled assessment of the constructed metric would require running the same sweep while only changing the sweep metric used. Without that, we're left with a counterfactual question of which runs would have survived had the metric used by Hyperband been different, which is impossible to directly answer due to the opacity of the service.

One thing we can do is run a rank correlation analysis on the metric against the eval loss near the points Hyperband would kill:

A visual approach is to see which runs above or below median eval loss or $Mu$ metric survived:

This shows that runs with a an eval loss below median did often at iteration 125 have a metric above median, suggesting that the metric probably helped these runs survive. And in fact we do see that many runs did in fact have a high learning rate and displayed unstable training characteristics, and that these areas of the hyperparameter space were explored repeatedly throughout the sweep process from early until late, as desired.

Conclusion

While experimenting with the wandb infrastructure, Bayesian algorithms, and Hyperband early stopping was interesting, and our metric was a clever way to "juice" the desired behavior out of that infrastructure, ultimately we simply took the lowest best eval loss displayed, and used

Hyperparameter Assignment default Final selected value
(weight decay)

Here is the eval loss chart for sweep 148:

These settings were used to train the model weights used for inference, to good effect. Please continue to the discussion of ablations.