# The case for unit-normalized loss (with back-propagation)

Imagine you're me - struggling to debug unstable training^{[1]} - then while taking a detour^{[2]} into optimizers you find out that not only does batch accumulation affect the step-size (effective learning-rate), the scalar loss value you end up with also scales the gradients.

I had assumed that gradients were automagically normalized by either the framework or the optimizer; neither is the case. This was obvious in hind-sight given the chain-rule but, to me coming from evolutionary strategies, this seemed like absolute insanity. How do you test optimizers if the step-size entirely depends on their progress in-training? How do you develop new optimizers, given that gradients can be anything from 1e-8 to 1e+8 depending on the loss function?

With this in mind I took a detour from my detour to test a new method; what if gradients but normalized?

## Brief intermezzo

Since I was working with various optimizers trying to compare them for my previous detour, one optimizer came to mind which did have 'normalized' step-sizes (generously put): SignSGD.

SignSGD is a bit of an odd case in the optimizer world, it effectively does: `params = params - sign(grad)*alpha`

which is about a simple as it gets, but due to sign always being in ${-1, 0, 1}$ the step-size is (mostly) constant.

Maybe you can also see the issue with this; it's terrible and it's optimizer-specific. We only need the norm to be equal between steps and for the same parameters, i.e. a part of the model that doesn't need to move much shouldn't have to but the amount of aggregate movement should be approximately the same (the norm of the gradients).

## Unit-normalization

To make sure I didn't fool myself I first set up our existing belief that scaling the loss does in fact change parameters through back-propagation:

```
model = nn.Linear(10, 10)
x = torch.randn(10, 10)
y = torch.randn(10, 10)
print("MSE with batchsize=10")
criterion = nn.MSELoss(reduction='mean')
out = model(x)
loss = criterion(out, y)
loss.backward()
print('mean batch\t\t', torch.norm(model.weight.grad))
model.zero_grad()
criterion = nn.MSELoss(reduction='sum')
out = model(x)
loss = criterion(out, y)
loss.backward()
print('sum batch\t\t', torch.norm(model.weight.grad))
model.zero_grad()
out = model(x)
loss = criterion(out, y)
(loss / 100).backward()
print('sum batch/(batchsize*10)\t', torch.norm(model.weight.grad))
```

You may notice that we are dividing by `10*batchsize`

(this does hold for other values) this is due to the amount of values we are comparing with MSE. This makes it very annoying to normalize since we need contextual information about what the loss function is doing, unless...

If we divide the loss by the loss (i.e. make loss=1); regardless of how the batch is accumulated (pre-backprop) the gradients end up the same:

```
model = nn.Linear(10, 10)
for i in range(10):
if i % 2 == 0:
x = torch.randn(10, 10)*(i+1)
y = torch.randn(10, 10)*(i+1)
model.zero_grad()
criterion = nn.MSELoss(reduction='sum' if i % 2 == 0 else 'mean')
out = model(x)
loss = criterion(out, y)
(loss / loss.item()).backward()
print(f'unit-norm {"sum" if i%2==0 else "mean"}\t\t', torch.norm(model.weight.grad), f'\terror={loss.item()}')
```

We have effectively normalized the gradients. Even as we increase the underlying error, the gradient norm stays stable.

Note: if you are doing batch accumulation you still need to divide by the amount you accumulate:`(loss/loss.item()/grad_accum).backward()`

; like-wise if you're aggregating within a batch with different expected loss magnitudes it may help to do`(loss/loss.detach()).mean().backward()`

with`reduction='none'`

## But why?

You may ask, it is a fair question if you're not in the trenches of optimizers and ML research. And my response would be "We need to get away from the alchemy", currently training can very easily destabilize and as previously mentioned I identified loss scaling and batch scaling as key factors preventing us from doing sound training comparisons and exploration.

If you have ever read an optimizer paper you'll notice that they always out-perform SGD and Adam, while in an actual comparison they typically under-perform. Here's AdaBelief (watch until the end)

There are arguments for certain optimizers being 'allowed' larger step-sizes - like being able to better handle stochastic noise - but from the video AdaBelief looks like it's Adam just sped up. So we can't tell if AdaBelief is just Adam with 10x learning-rate or if it's actually being intelligent about it's speed.

For *real* applications this problem becomes exponentially worse as there are many things to factor in including batchsize, batch accumulation, the outsized effect of 'bad' samples in the dataset due to the higher loss value. You really do not want to have to filter through those things to train models or tune/develop/test optimizers.

## Wait, doesn't this mess up current training schemes?

If you've been paying attention and thinking along you may realize an issue with normalizing the gradients compared to how back-prop is current utilized. As the loss gets lower the step-size automatically decreases (for better or worse); but with unit-norm it doesn't necessarily decrease. Wouldn't this mess up long-tail convergence? And optimizers currently assume this decay, does this mean we need different optimizers?

To be honest I don't know, I haven't ran enough experiments with this method; but regardless it's not something that's hard to overcome. We've used ReduceOnPlateau and learning-rate decays forever; the difference is that the schedule now actually does what you'd expect rather than it being modulated by the scaling of your loss function.

## References?

I didn't actually refrence many papers directly (only AdaBelief, SignSGD and Adam), but I do have some interesting things I found afterwards:

- Mention of the unnormalized gradient issue in Diffusion: https://youtu.be/T0Qxzf0eaio?si=BLpUE_4d5rOQ5nUD&t=2730