Robotics - Gauss Newton (GN) and Levenberg Marquardt (LM) Optimizers

Newton's Method, GN, LM optimizers

Posted by Rico's Nerd Cluster on April 14, 2024

Newton’s Method

Solving An Equation

To find an arbitrary equation’s root f(x)=0,

  • We start from an arbitrary point x0 that’s hopefully close to the solution, xs
  • The main idea of Newton’s method is, we draw a tangent line at x0, this line will cross y=0 at x1. It’s most likely that this point is closer to xs than x0. So, this is to solve x1f(x0)x0f(x0)+f(x0)=0, and we get x1=x0f(x0)f(x0)

So at each iteration, the newer estimate is:

xn+1=xnf(x0)f(x0)

Some caution-worthy notes are:

  • f(x0) should be non-zero. Otherwise, the estimate will NOT move
  • In vanilla gradient descent, we only find the gradient and issue an update Δx=λg using a fixed step size λ. However, in Newton’s method, we not only follow the gradient direction, but we also “solve” for the step size. That makes Newton’s method faster.

Example: Solve For Square-Roots

If we want find m, we can format this problem to solving:

f(x)=x2m=0xn+1=xnxn2m2xn=xn2+m2xn
1
2
3
4
5
6
7
8
9
10
11
const double EPS = 0.0001; 
double sqrt(double m){
    if (m == 0) return m;
    double x = m;
    double last_res;
    do {
        last_res = x;
        x = x / 2.0 + m / (2 * x);
    } while (abs(last_res - x) > EPS);
    return x;
}

Faster version (TODO), inspired by John Carmack

1
2
3
4
5
6
7
8
9
10
11
12
13
double sqrt_c(float x) { 
    if(x == 0) return 0; 
    float result = x; 
    float xhalf = 0.5f*result; 
    int i = *(int*)&result; 
    // 1048009350
    i = 0x5f375a86- (i>>1); // what the fuck? 
    result = *(float*)&i; 
        cout<<result<<endl; //0.241556
    result = result*(1.5f-xhalf*result*result); // Newton step, repeating increases accuracy 
    result = result*(1.5f-xhalf*result*result); 
    return 1.0f/result; 
}

Newton’s Method for Optimization

When optimizing f(x), again we start from an arbitrary point x0. If we can achieve the optimum at x0+Δx:

We think of it as:

Taylor Expansion:f(x+Δx)f(x0)+f(x0)Δx+12(f(x0))2Δx2Optimum: f(x0+Δx)Δx=0f(x+Δx)=f(x0)+f(x0)Δx=0Δx=f(x0)f(x0)x1=x0+Δx=x0f(x0)f(x0)

So iteratively, at timestep n, we have:

xn+1=xnf(xn)f(xn)

See how Newton’s method has similar forms for both equation solving and optimization? The better the function f is approximated by Taylor Expansion, the faster we converge.

For higher dimensions, if f is a scalar-valued function, and x is [m,1] we have:

xn+1=xnJ(xn)H(xn)1
  • J is [1, m], H is [m,m]

Gauss-Newton Optimization

In Gauss Newton, we specifically look at minimizing a least squares problem. Assume we have a:

  • scalar-valued cost function c(x),
  • vector-valued function: f(x), [m, 1]
  • Jacobian J0 at x0 is consequently [m, n]
  • Hessian H is D2c(x). It’s approximated as JTJ
c(x)=|f(x)2|x=argmin(|f(x)2|)First order Taylor Expansion:argminΔx(|f(x+Δx)2|)=argminΔx[(f(x0)+J0Δx)T(f(x0)+J0Δx)]=argminΔx[f(x0)Tf(x0)+f(x0)TJ0Δx+(J0Δx)Tf(x0)+(J0Δx)T(J0Δx)]=argminΔx[f(x0)Tf(x0)+2f(x0)TJ0Δx+(J0Δx)T(J0Δx)]

Take the derivative of the above and set it to 0, we get

f(x+Δx)2Δx=2J0Tf(x0)+[(J0TJ0)+(J0TJ0)T]Δx=2J0Tf(x0)+2(J0TJ0)Δx=0

So we can solve for Δx with H=J0TJ0, b=J0Tf(x0):

(J0TJ0)Δx=J0Tf(x0)HΔx=g
  • Note: because J0 may not have an inverse, here we cannot multiply J01 to eliminate J0T
  • In fact, to Δx is available if and only if H is positive definite.
  • In least square, f(x) is a.k.a residuals. Usually, it represents the error between a data point and from its ground truth.

In SLAM, we always frame this least squares problem with e = [observered_landmark - predicted_landmark] at each landmark. So all together, we want to minimize the total least squares of the difference between observations and predictions. In the meantime, at each landmark, there is an error covariance, so all together, there’s an error matrix Σ. Here in cost calculation, we take Σ1 so the larger the error covariance, the lower the weight the corresponding difference gets.

With e(x+Δx)e(x)+JΔx,

x=argmin(|eTΣ1e|)argminΔx(|e(x+Δx)TΣ1e(x+Δx)|)similar steps as above ... (J0TΣ1J0)Δx=J0TΣ1f(x0)

Using Cholesky Decomposition, one can get Σ1=ATA. Then we can write the above as

((AJ0)T(AJ0))Δx=(AJ0)TAf(x0)

For a more detailed derivation, please see here

When Δx is small, or f(x) is small, GN has faster convergence. When either or both of them are large, or when the target function is highly non-linear, G-N does not work well.

Levenberg-Marquardt (LM) Optimization

Again, Taylor expansion works better when Δx is small, so the function can be better estimated by it. So, similar to regularization techniques on step sizes in deep learning, like L1, L2 regularization, we can regularize the step size, Δx. μ is called “the damping factor”

(H+μIH)Δx=J0Tf(x0)=g

Intuitively,

  • as μ grows, the diagonal identity matrix μIH grows, so H+μIHμIH. So, Δx(H+μIH)1g=gμ, which means Δx grows smaller. In the meantime, Δx will be similar to that in gradient descent.
  • as μ becomes smaller, Δx will become more like Gauss-Newton. However, due to μIH, (H+μIH) is positive semi-definite, which provides more stability for solving for Δx.

LM in general is more stable than GN. However, it’s more computationally costly.

References

https://scm_mos.gitlab.io/algorithm/newton-and-gauss-newton/



Related Issues not found

Please contact @ricojia to initialize the comment