Evolutionary Strategies as a Pytorch Optimizer

Motivation

First-order optimizers derive gradient through backprop, these gradients are $\epsilon$ local, where $\epsilon$ is the cumulative error of the computation unit (e.g. IEEE 754). This means that the 'step' that is taken by SGD as seen from the perspective of approximately $\theta_\epsilon ~ N(\theta, \epsilon*d)$ where d represents the computational depth from a fixed point (i.e. input data/loss function). see GradIEEEnt half decent for a lighthearted explainer. The value of $epsilon$ is not the same for all elements in $theta$ but for f32 we can expect it to be <1e-8 for most cases; we leave verifying this as an exercise to the reader.

In $RR$, $epsilon$ would be presumed to be exactly 0. Which means the gradient for any optimization step would be entirely based on the current parameterization of $theta_t$ rather than $theta_(t+1)$ (since we don't know yet what the next step is).

This locality of the gradients is a large contributor to training instability and getting stuck at local minima, so ways to get around the locality were devised. First momentum SGD, then more adaptive methods (AdaBound, AdaDelta, etc) then finally the optimizer most popular today Adam. Most (high-performing) optimizers today are loosely based on Adam as it's so effective. However Adam still relies entirely on these local gradients and thus is still likely to get into local minima.

Evolutionary Strategies (ES) offers an opportunity to get non-local gradient data and get to a better position. The main issue with evolutionary stratgies is that it requires evaluations proportional to (we hypothesize) $O(logd)$ for $theta in RR^(d)$ which means ~$20c$ evaluations per batch for a 1B model.

This article will focus on three proposed techniques to integrate Evolutionary Strategies into a 'regular' optimizer:

  • Evolutionary SGD (Using the backprop gradients from different pertubations of $theta$ and combining those)
  • Evolutionary Adam (Using ES as momentum data for an adaptive optimizer)
  • Evolutionary Strategies (Using ES as-is to feed a secondary optimizer)
  • Combined ES and ESGD

Evolutionary SGD

Evolutionary SGD tries to combine the best of both worlds; It uses back-prop as it's gradient derivation rather than a ranked weighted sum based on loss value as is in regular ES, it also does not repeat same batch across different $theta$s rather it still assumes stochastic gradients.

  • For $k$ mini-batches
    • Taking $theta_t$ we create a perturbed version $hat theta_t ~ N(theta_t, sigma)$,
    • We execute a single mini-batch of our regular back propagation protocol using $hat theta_t$
  • Take an SGD step

$k$ can be any integer, even 1 (step based on a randomly offset $theta$).

This model is almost as compute efficient as a batch-accumulated SGD, the only additional step is generating the perturbations. This makes it an attractive option to evaluate.

One more parameter can be added to improve convergence garantees, this is antithetical perturbation sampling. For any sampling of $hat theta_t$ you have a seed from which you get a delta $DeltaS ~ N(0, sigma)$, with antithetical sampling you sample using $hat theta_t larr theta_t +- DeltaS$ such that both opposite directions get equal gradient accumulation; in practice this means you modify the algorithm as such:

  • For $k$ mini-batches
    • Sample $DeltaS ~ N(0, sigma)$
    • Execute a mini-batch of back-prop using $hat theta_t larr theta_t + DeltaS$
    • Execute a mini-batch of back-prop using $hat theta_t larr theta_t - DeltaS$
  • Take an SGD step

Advantages of this model are plentiful; fast scalable local-minima avoiding optimization. However we have strayed quite far from backprop-SGD and ES so we've lost any convergence guarantees based on those theories. However since $epsilon_(Delta theta) ~ N(0, epsilon d)$ is already the case in backprop, we can view modification algorithm as simply increasing the effective error that would normally come from numerical instability. Another way to look at it is that it can dynamically increase the area in parameter-space over which the gradient (on average) is calculated from $epsilon d$ to $epsilon d sigma$.

Empirically this method shows unreasonably good results in MomoML's optimizerbench beating all (60) other optimizer on Ackley for ($d in ["10k", "100k"]$) without even using the master weights for the evaluation. To be exact the average rank of ESGD is 60.25, the next best is SGD with 49.14. This is strong indication that ESGD differs significantly from existing optimizers and can beat them in parameteric-noise scenarios.

[TODO: table-ize the json with first/second/third bold/italic/underscore]

{'adabelief': 23.0, 'adabound': 48.0, 'pid': 19.879999999999985, 'adamp': 23.29, 'adai': 37.41000000000002, 'adan': 4.610000000000001, 'adamod': 3.8799999999999986, 'adapnm': 10.720000000000004, 'diffgrad': 31.590000000000003, 'lamb': 30.060000000000006, 'lars': 28.859999999999992, 'qhadam': 32.93999999999999, 'qhm': 32.54000000000001, 'madgrad': 33.77, 'nero': 34.19, 'pnm': 32.11, 'msvag': 36.779999999999994, 'radam': 40.65000000000001, 'ranger': 31.709999999999994, 'ranger21': 30.979999999999986, 'sgdp': 41.009999999999984, 'scalableshampoo': 41.089999999999975, 'dadaptadagrad': 37.03000000000001, 'fromage': 35.64999999999998, 'aggmo': 36.49000000000001, 'dadaptadam': 32.41, 'dadaptsgd': 34.6, 'dadaptadan': 36.28000000000001, 'adams': 34.87, 'adafactor': 39.12999999999999, 'apollo': 36.0, 'swats': 36.93999999999999, 'novograd': 33.28999999999999, 'lion': 30.120000000000005, 'sm3': 28.67000000000001, 'adanorm': 24.480000000000004, 'a2grad': 18.03000000000001, 'accsgd': 22.090000000000003, 'sgdw': 23.95000000000001, 'yogi': 22.26000000000001, 'asgd': 25.03, 'adamax': 21.72000000000001, 'gravity': 22.220000000000002, 'adasmooth': 26.439999999999994, 'srmm': 31.13999999999999, 'avagrad': 33.54000000000001, 'adashift': 27.019999999999996, 'adadelta': 18.080000000000005, 'amos': 24.61, 'sophiah': 28.680000000000014, 'signsgd': 26.91, 'prodigy': 19.55, 'padam': 22.989999999999995, 'tiger': 32.17999999999999, 'came': 40.25000000000001, 'dadaptlion': 45.10999999999999, 'aida': 40.26, 'galore': 15.449999999999998, 'adalite': 26.699999999999996, 'adam': 42.37, 'sgd': 49.14, 'esgd': 60.25000000000001}
final_ranking={'adamod': 0, 'adan': 1, 'adapnm': 2, 'galore': 3, 'a2grad': 4, 'adadelta': 5, 'prodigy': 6, 'pid': 7, 'adamax': 8, 'accsgd': 9, 'gravity': 10, 'yogi': 11, 'padam': 12, 'adabelief': 13, 'adamp': 14, 'sgdw': 15, 'adanorm': 16, 'amos': 17, 'asgd': 18, 'adasmooth': 19, 'adalite': 20, 'signsgd': 21, 'adashift': 22, 'sm3': 23, 'sophiah': 24, 'lars': 25, 'lamb': 26, 'lion': 27, 'ranger21': 28, 'srmm': 29, 'diffgrad': 30, 'ranger': 31, 'pnm': 32, 'tiger': 33, 'dadaptadam': 34, 'qhm': 35, 'qhadam': 36, 'novograd': 37, 'avagrad': 38, 'madgrad': 39, 'nero': 40, 'dadaptsgd': 41, 'adams': 42, 'fromage': 43, 'apollo': 44, 'dadaptadan': 45, 'aggmo': 46, 'msvag': 47, 'swats': 48, 'dadaptadagrad': 49, 'adai': 50, 'adafactor': 51, 'came': 52, 'aida': 53, 'radam': 54, 'sgdp': 55, 'scalableshampoo': 56, 'adam': 57, 'dadaptlion': 58, 'adabound': 59, 'sgd': 60, 'esgd': 61}

TODO: test/compare scaling gradients by their running_mean(loss)/ loss; or check gradients are normalized by default by their loss

TODO: create contour animation of a batch being accumulated (quiver at each sampled point + average quiver at $theta$)

TODO: create sucess-rate table based on MomoML tasks comparing to SGD, Adam (and top optimizer)

Combined ES and ESGD

Running ES and SGD at the same time efficiently can be difficult since ES requires knowing all previously evaluated $hat theta_t$ which costs a lot of memory. This can be avoided by only saving the seed for each $DeltaS$ sampled. So the final optimizer becomes relatively straight-forward, similar to ESGD you run the model for $k$ steps to accumulate backprop gradients, and meanwhile you keep track of $<<"seed", "loss">>$ pairs, when it's time to do an optimization step you rank the losses and do a weighted average of the seeds:

$ L_"rank"(theta; f; R_theta) = |{r in R_theta: r <= f(theta)}|$

$R_theta$ is all the individual loss values in a batch. Then the optimization step is:

$ theta_(t+1) = theta_t - sum_(i=0)^(i->k) 1/k(alpha L_"rank"(theta + DeltaS_i) DeltaS_i + (1-alpha) grad_theta L(theta + DeltaS_i)): DeltaS_i ~ N(0, sigma)$

$alpha$ determines weighting between Backprop and ES for the final gradient.

Interestingly enough this algorithm can be run for the same memory cost as SGD with momentum (i.e. 2x trainable parameters), since all of the ES operations can be done in-place once you know the ranking and the parameters can be re-derived from the seed. The only limitation of this is that it requires gradient accumulation, and while the is already done by default in many distributed training settings it is an additional inflexibility.

Appendix A: Experiment Details

OptimizerBench

For the experiments done in this article a hyperparameter optimization budget of 8 using the following search space:

beta_budget = 3
beta1 = 1-np.geomspace(1e-3, 0.99, num=beta_budget)**(1/3) 
beta2 = 1-np.geomspace(1e-3, 0.99, num=beta_budget)**(1/1.5)
beta3 = 1-np.geomspace(1e-3, 0.99, num=beta_budget)
def beta_sampler(default):
    selection = []

    for i, v in enumerate(default):
        if v >= 0.995:
            selection.append(beta3.tolist())
        elif v >= 0.985:
            selection.append(beta2.tolist())
        else:
            selection.append(beta1.tolist())
    return product(*selection)
            
lr_search_space = np.geomspace(1e-4, 128, num=16).tolist()
sigma_search_space = np.linspace(1e-2, 2, num=5).tolist()

{
     'lr': (lambda _: lr_search_space),
     'sigma': (lambda _: sigma_search_space),
     'update_period': (lambda _: [2]),
     'betas': beta_sampler,
     'num_iterations': (lambda _: [100]),
     'p': (lambda _: [1, 2])
 }

[TODO: provide optimal parameters for optimizer per task in tables]

Each optimizer was run for 100 seeds for 1000 steps. Our methods, which rely on a form of gradient accumulation, were always given the same number of batches as the other optimizers; effectively reducing the amount of optimization steps they could take by a factor of $1/k$.

Appendix B: Optimizerbench Methodology

[Maybe separate article]

Finding the best optimizer is a bit of an impossible task given that there are many hyper-parameters to tune, many tasks to test on, and a divergence between results across tasks. To make a informed judgement on choice of optimizer requires re-framing the meaning of 'best'. Likewise 'loss' or 'error' are meaningless metrics as they are entirely dependent on the task.

Optimizerbench therefore focuses on ranks of optimizers; i.e. a method that can be used to reliably compare any pair of optimizers while also having satisfying partial ordering restrictions.

The method we've settled on is the following:

$"Compare"(A, B) = (sum_(i->n) A_i > B_i) > (sum_(i->n) B_i > A_i): n = |A| = |B|$

Where $A$ & $B$ are the history of losses on on a particular task on the same seed. If multiple samples; you sample multiple $A$ and $B$ using the same seeds and average them; the exact method of averaging can be debated, but since in all runs they start at the same point it should give a fairly good overview.

You can also use the same method to find the best hyper-parameters for a single optimizer by comparing the results from different settings; treating them as separate optimizers (which they often -effectively- are).

Additionally to remove weight-decay and similar biased estimations from contributing to the evaluation we sample (again from the seed) a random offset; this will shift the entire problem space including starting parameters and bounds by a scalar sampled from $"offset" ~ N(0, 2)$. For unbiased optimizers such as SGD this offset is effectively a no-op.

In the process of building/testing Optimizerbench I also discovered a few bugs in pytorch-optimizer[1][2] indicating that many optimizers are possibly faulty at an implementation level and are simply not utilized enough to get the attention to be fixed or investigated.