Deep Learning - Gradient Checking

First Step To Debugging A Neural Net

Posted by Rico's Nerd Cluster on January 4, 2022

How To Do Gradient Checking

In calculus, we all learned that a derivative is defined as:

\[\begin{gather*} f'(\theta) = \frac{f(\theta + \epsilon) - f(\theta)}{\epsilon} \end{gather*}\]

Source: Andrew Ng's Deep Learning Class

In a neural net, the gradient of a parameter $w$ w.r.t cost function can be formulated in a similar way. However, the numerical error of this method is in the order of $\epsilon$. E.g., if $\epsilon = 0.1$, this method will yield an error in the order of 0.01. Why? Please close your eyes and think for a moment before moving on?

Because:

\[\begin{gather*} f(\theta + \epsilon) = f(\theta) + \epsilon f'(\theta) + O(\epsilon^2) \\ => \\ f'(\theta) = \frac{f(\theta + \epsilon) - f(\theta)}{\epsilon} = f'(\theta) + \frac{O(\epsilon^2)}{\epsilon} = O(\epsilon) \end{gather*}\]

One way to reduce this error is to use central difference formula to verify your grads are correct:

\[\begin{gather*} f'(\theta) = \frac{f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon} \end{gather*}\]

To apply gradient checking on a single parameter $w_i$:

  1. Apply foreprop, and backprop to get gradient of $w_i$, $g_i$.
  2. Apply a small change to $w_i$, then do foreprop, backprop, and get gradient $g_i’$
  3. Calculate:
\[\begin{gather*} \frac{||g_i - g_i'||}{||g_i|| + ||g_i'||} \end{gather*}\]

If the result is above $10^{-3}$, then we should worry about it.

\[\frac{f(x+\epsilon) - f(x - \epsilon)}{2 \epsilon}\]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def _numerical_grad(fn, x, eps=1e-3):
 # This will store the numerical gradient
 grad = torch.zeros_like(x)
 it = x.numel()
 x_flat = x.view(-1)
 grad_flat = grad.view(-1) 
 for i in range(it):
  orig = x_flat[i].item()
  x_flat[i] = orig + eps 
  fp = fn().item()
  x_flat[i] = orig - eps 
  fm = fn().item()
  x_flat[i] = orig 
  grad_flat[i] = (fp - fm) / (2.0 * eps)

Another Method - Apply Gradient And See Decrease In Loss

  • You can even apply gradient on the inputs, evaluate with the loss again. That’s gradient descent
1
2
3
4
5
6
7
8
9
loss_before = _chamfer_loss(p1, p2)
loss_before.backward()

with torch.no_grad():
 p1_stepped = p1 - 0.01 * p1.grad
loss_after = _chamfer_loss(p1_stepped, p2)

assert loss_after < loss_before, \
 f"Gradient step did not reduce loss: {loss_before.item():.4f}{loss_after.item():.4f}"

Things To Note In Gradient Checking

  • One thing to note is gradient check does NOT work with dropout. Because the final cost is influenced by turning off some other neurons as well.
  • Let the neural net run for a while. Because when w and b are close to zero, the wrong gradients may not surface immediately