Newton’s Method
Solving An Equation
To find an arbitrary equation’s root
- We start from an arbitrary point
that’s hopefully close to the solution, - The main idea of Newton’s method is, we draw a tangent line at
, this line will crossy=0
at . It’s most likely that this point is closer to than . So, this is to solve , and we get
So at each iteration, the newer estimate is:
Some caution-worthy notes are:
should be non-zero. Otherwise, the estimate will NOT move- In vanilla gradient descent, we only find the gradient and issue an update
using a fixed step size . However, in Newton’s method, we not only follow the gradient direction, but we also “solve” for the step size. That makes Newton’s method faster.
Example: Solve For Square-Roots
If we want find
1
2
3
4
5
6
7
8
9
10
11
const double EPS = 0.0001;
double sqrt(double m){
if (m == 0) return m;
double x = m;
double last_res;
do {
last_res = x;
x = x / 2.0 + m / (2 * x);
} while (abs(last_res - x) > EPS);
return x;
}
Faster version (TODO), inspired by John Carmack
1
2
3
4
5
6
7
8
9
10
11
12
13
double sqrt_c(float x) {
if(x == 0) return 0;
float result = x;
float xhalf = 0.5f*result;
int i = *(int*)&result;
// 1048009350
i = 0x5f375a86- (i>>1); // what the fuck?
result = *(float*)&i;
cout<<result<<endl; //0.241556
result = result*(1.5f-xhalf*result*result); // Newton step, repeating increases accuracy
result = result*(1.5f-xhalf*result*result);
return 1.0f/result;
}
Newton’s Method for Optimization
When optimizing f(x)
, again we start from an arbitrary point
We think of it as:
So iteratively, at timestep n
, we have:
See how Newton’s method has similar forms for both equation solving and optimization? The better the function f
is approximated by Taylor Expansion, the faster we converge.
For higher dimensions, if f
is a scalar-valued function, and x
is [m,1]
we have:
- J is
[1, m]
, H is[m,m]
Gauss-Newton Optimization
In Gauss Newton, we specifically look at minimizing a least squares problem. Assume we have a:
- scalar-valued cost function
, - vector-valued function:
,[m, 1]
- Jacobian
at is consequently[m, n]
- Hessian
is . It’s approximated as
Take the derivative of the above and set it to 0, we get
So we can solve for
- Note: because
may not have an inverse, here we cannot multiply to eliminate - In fact, to
is available if and only if is positive definite. - In least square,
is a.k.a residuals. Usually, it represents the error between a data point and from its ground truth.
In SLAM, we always frame this least squares problem with e = [observered_landmark - predicted_landmark]
at each landmark. So all together, we want to minimize the total least squares of the difference between observations and predictions. In the meantime, at each landmark, there is an error covariance, so all together, there’s an error matrix
With
Using Cholesky Decomposition, one can get
For a more detailed derivation, please see here
When
Levenberg-Marquardt (LM) Optimization
Again, Taylor expansion works better when
Intuitively,
- as
grows, the diagonal identity matrix grows, so . So, , which means grows smaller. In the meantime, will be similar to that in gradient descent. - as
becomes smaller, will become more like Gauss-Newton. However, due to , is positive semi-definite, which provides more stability for solving for .
LM in general is more stable than GN. However, it’s more computationally costly.
References
https://scm_mos.gitlab.io/algorithm/newton-and-gauss-newton/
Related Issues not found
Please contact @ricojia to initialize the comment