How To Do Gradient Checking
In calculus, we all learned that a derivative is defined as:
\[\begin{gather*} f'(\theta) = \frac{f(\theta + \epsilon) - f(\theta)}{\epsilon} \end{gather*}\]
In a neural net, the gradient of a parameter $w$ w.r.t cost function can be formulated in a similar way. However, the numerical error of this method is in the order of $\epsilon$. E.g., if $\epsilon = 0.1$, this method will yield an error in the order of 0.01. Why? Please close your eyes and think for a moment before moving on?
Because:
\[\begin{gather*} f(\theta + \epsilon) = f(\theta) + \epsilon f'(\theta) + O(\epsilon^2) \\ => \\ f'(\theta) = \frac{f(\theta + \epsilon) - f(\theta)}{\epsilon} = f'(\theta) + \frac{O(\epsilon^2)}{\epsilon} = O(\epsilon) \end{gather*}\]One way to reduce this error is to use central difference formula to verify your grads are correct:
\[\begin{gather*} f'(\theta) = \frac{f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon} \end{gather*}\]To apply gradient checking on a single parameter $w_i$:
- Apply foreprop, and backprop to get gradient of $w_i$, $g_i$.
- Apply a small change to $w_i$, then do foreprop, backprop, and get gradient $g_i’$
- Calculate:
If the result is above $10^{-3}$, then we should worry about it.
\[\frac{f(x+\epsilon) - f(x - \epsilon)}{2 \epsilon}\]1
2
3
4
5
6
7
8
9
10
11
12
13
14
def _numerical_grad(fn, x, eps=1e-3):
# This will store the numerical gradient
grad = torch.zeros_like(x)
it = x.numel()
x_flat = x.view(-1)
grad_flat = grad.view(-1)
for i in range(it):
orig = x_flat[i].item()
x_flat[i] = orig + eps
fp = fn().item()
x_flat[i] = orig - eps
fm = fn().item()
x_flat[i] = orig
grad_flat[i] = (fp - fm) / (2.0 * eps)
Another Method - Apply Gradient And See Decrease In Loss
- You can even apply gradient on the inputs, evaluate with the loss again. That’s gradient descent
1
2
3
4
5
6
7
8
9
loss_before = _chamfer_loss(p1, p2)
loss_before.backward()
with torch.no_grad():
p1_stepped = p1 - 0.01 * p1.grad
loss_after = _chamfer_loss(p1_stepped, p2)
assert loss_after < loss_before, \
f"Gradient step did not reduce loss: {loss_before.item():.4f} → {loss_after.item():.4f}"
Things To Note In Gradient Checking
- One thing to note is gradient check does NOT work with dropout. Because the final cost is influenced by turning off some other neurons as well.
- Let the neural net run for a while. Because when w and b are close to zero, the wrong gradients may not surface immediately