# Turning Adam Optimization into SGD

02 Jul 2018

## Motivation

This strange question came up when working on a machine learning project to generate embeddings. Working with the version of Pytorch available on our DGX (similar to version 0.3.1), I found there was an optimizer called SparseAdam but not one called SparseSGD. Since what I really wanted to do was use SGD, I wondered: could I turn the Adam optimizer into an SGD optimizer by setting the hyperparameters $\beta_1$, $\beta_2$, and $\epsilon$?

Probably not. Looking at the original paper for Adam, the formula for the parameter updates is:

To make this equal to gradient descent, we need the second term to equal the gradient.

Luckily, $m_t$ is directly related to the gradient, via the equation:

Clearly, setting $\beta_1=0$ will set the value to the gradient value. Note that this will also mean the normalization doesn’t change $m_t$.

The problem is the term $\hat{v}_t$, defined as:

We want this term to equal 1, to disappear from the fraction. However, setting $\beta_2=0$ will cause it to be proportional to the square of the gradient, and setting $\beta_2 = 1$ will cause a division by 0 error in the normalization. So because of this, I don’t see a way to convert Adam into SGD. The gradient normalization is just build in too much into the algorithm.

## Conclusion

I don’t think it is possible. And after reading the docs again, SGD is already compatible with sparse matrices, so this was completely unnecessary. It was a fun thought exercise though.