Gradient approximations in derivative-free optimization

calvinmccarter.substack.com

Most of modern ML is heavily reliant on gradient descent and its variants. We have some loss function we want to minimize; if the loss function we actually care about is not differentiable, we modify it until we have something that is. Then, we minimize it; if the loss function is non-convex, we don't worry too much about it. Typically, we do something like stochastic gradient descent (SGD) on a loss function that corresponds to empirical risk minimization (ERM). At each optimization step, we have a noisy approximation of the loss function and its gradient, using a random sample from our dataset. We hope that this helps us overcome non-convexity issues. Actually, we typically don't do vanilla SGD; we use techniques like momentum and Adam that account for the curvature of the loss function. These techniques utilize our estimates of the gradient from previous optimization steps. On rarer occasions where we don't have lots of samples, we skip the “stochastic” part of SGD; and we often use approximate Newton methods like L-BFGS which account for curvature (Hessian), also using gradients from previous steps.

## Gradient approximations in derivative-free optimization

## Gradient approximations in derivative-free…

## Gradient approximations in derivative-free optimization

Most of modern ML is heavily reliant on gradient descent and its variants. We have some loss function we want to minimize; if the loss function we actually care about is not differentiable, we modify it until we have something that is. Then, we minimize it; if the loss function is non-convex, we don't worry too much about it. Typically, we do something like stochastic gradient descent (SGD) on a loss function that corresponds to empirical risk minimization (ERM). At each optimization step, we have a noisy approximation of the loss function and its gradient, using a random sample from our dataset. We hope that this helps us overcome non-convexity issues. Actually, we typically don't do vanilla SGD; we use techniques like momentum and Adam that account for the curvature of the loss function. These techniques utilize our estimates of the gradient from previous optimization steps. On rarer occasions where we don't have lots of samples, we skip the “stochastic” part of SGD; and we often use approximate Newton methods like L-BFGS which account for curvature (Hessian), also using gradients from previous steps.