Chapter 2 The Gauss-Newton Method
The Gauss-Newton method is a second-order optimization technique used in non-linear regression to minimize the chi-squared cost function. It is especially useful for non-linear least squares problems where the model’s prediction depends on parameters in a nonlinear way.
The key here is that the Gauss-Newton method assumes the cost function behaves approximately linearly around the current parameter \(\alpha\). This means we can approximate the change in the cost function \(\chi^2\) when \(\alpha\) is perturbed by a small amount \(h\).
Gradient descent, on the other hand, is a first-order optimization method that uses only the gradient (the first derivative) of the cost function to update the parameters.
2.1 Computation of the chi-squared function
We can approximate the chi-squared function for a small change in \(\alpha\):
\[
\chi^2(\alpha + h) = \chi^2(\alpha) + \frac{\partial \chi^2}{\partial \alpha} h
\]
This is a first-order Taylor expansion of the cost function, assuming that the change in \(\alpha\) is small enough that the higher-order terms are negligible. The term \(\frac{\partial \chi^2}{\partial \alpha}\) represents the gradient of the chi-squared function with respect to \(\alpha\), which tells us how the cost changes as we move in the direction of \(h\).
Let’s start with the chi-squared function:
\[ \chi^2(\alpha) = y^T W y - 2 y^T W \hat{y}(\alpha) + \hat{y}(\alpha)^T W \hat{y}(\alpha) \]
Where:
- \(y\) is the observed data,
- \(\hat{y}(\alpha)\) is the predicted value from the model for the current parameter \(\alpha\),
- \(W\) is a weight matrix.
We assume that the chi-squared function around the current parameter \(\alpha\) is linear and that can be expressed as:
\[ \chi^2(\alpha +h)\approx\chi^2(\alpha) + \frac{\partial \chi^2}{\partial \alpha}h \]
This is a first-order Taylor expansion of the cost function, assuming that the change in \(\alpha\) is small enough that the higher-order terms are negligible. The term \(\frac{\partial \chi^2}{\partial \alpha}\) represents the gradient of the chi-squared function with respect to \(\alpha\), which tells us how the cost changes as we move in the direction of \(h\).
Now, we can substitute the expression for \(\hat{y}(\alpha + h)\) into the cost function. Using the first-order approximation for \(\hat{y}(\alpha + h)\), which is:
\[ \hat{y}(\alpha + h)\approx \hat{y}(\alpha) + J h \]
Where \(J\) is the Jacobian matrix, which contains the partial derivatives of the model predictions with respect to the parameters \(\alpha\), we substitute this into the expanded chi-squared function:
\[ \chi^2(\alpha + h) \approx y^T W y - 2 y^T W (\hat{y}(\alpha) + J h) + (\hat{y}(\alpha) + J h)^T W (\hat{y}(\alpha) + J h) \]
This gives the following terms:
- The first term, \(y^T W y\), is constant and doesn’t change with \(\alpha\).
- The second term expands as:
\[ -2 y^T W (\hat{y}(\alpha) + J h) = -2 y^T W \hat{y}(\alpha) - 2 y^T W J h \]
- The third term is:
\[ (\hat{y}(\alpha) + J h)^T W (\hat{y}(\alpha) + J h) = \hat{y}(\alpha)^T W \hat{y}(\alpha) + \hat{y}(\alpha)^T W J h + h^T J^T W \hat{y}(\alpha) + h^T J^T W J h \]
We can simplify the expression further since:
- The terms \(y^T W y\) and \(\hat{y}(\alpha)^T W \hat{y}(\alpha)\) remain unchanged.
- The cross terms \(\hat{y}(\alpha)^T W J h\) and \(h^T J^T W \hat{y}(\alpha)\) are equal, so they combine into:
\[ 2 \hat{y}(\alpha)^T W J h \]
Thus, the simplified expression for \(\chi^2(\alpha + h)\) is:
\[ \chi^2(\alpha + h) \approx y^T W y - 2 y^T W \hat{y}(\alpha) - 2 y^T W J h + \hat{y}(\alpha)^T W \hat{y}(\alpha) + 2 \hat{y}(\alpha)^T W J h + h^T J^T W J h \]
This is the expanded chi-squared function for a small perturbation \(h\).
2.2 Computation of the gradient
Let’s move on to the next step where we take the derivative of the expanded form of \(\chi^2(\alpha + h)\) with respect to \(h\).
- We have already expanded \(\chi^2(\alpha + h)\) as:
\[ \chi^2(\alpha + h) \approx y^T W y + \hat{y}(\alpha)^T W \hat{y}(\alpha) - 2 y^T W \hat{y}(\alpha) - 2 (y - \hat{y}(\alpha))^T W J h + h^T J^T W J h \]
so let’s compute the derivative of \(\chi^2(\alpha + h)\) with respect to the update vector \(h\).
The first section \(y^T W y\), its derivative is 0.
The second section \(\hat{y}(\alpha)^T W \hat{y}(\alpha)\), its derivative is also 0.
The third section -\(2 y^T W \hat{y}(\alpha)\) is also 0.
The fourth section -\(2 (y - \hat{y}(\alpha))^T W J h\):
\[ \frac{\partial}{\partial h} \left( -2 (y - \hat{y}(\alpha))^T W J h \right) = -2 W^T J (y - \hat{y}(\alpha)) \]
- For the last section \(h^T J^T W J h\):
\[ \frac{\partial}{\partial h} (h^T J^T W J h) = 2 J^T W J h \]
- Combining all the terms, we can get the derivative of \(\chi^2(\alpha + h)\) with respect to \(h\):
\[ \frac{\partial \chi^2(\alpha + h)}{\partial h} \approx -2 W^T J (y - \hat{y}(\alpha)) + 2 J^T W J h \]
- It is also commonly known as:
\[ \frac{\partial \chi^2(\alpha + h)}{\partial h} \approx -2 (y - \hat{y}(\alpha))^T W J + 2 h^T J^T W J \]
where the \(J^TWJ\) is also known as the approximation of Hessian matrix.
2.3 The parameter updating
In the previous steps, we derived the gradient of the cost function \(\chi^2(\alpha + h)\) with respect to the update direction \(h\). This gradient provides insight into how the cost function changes as the parameters \(\alpha\) are adjusted by a small amount \(h\).
Here’s the expression for the gradient:
\[ \frac{\partial \chi^2(\alpha + h)}{\partial h} \approx -2 \left( (y - \hat{y}(\alpha))^T W J + 2 h^T J^T W J \right) \]
The goal is to find the update direction \(h_{\text{gn}}\) that minimizes the cost function. To do this, we set this derivative equal to zero:
\[ 0 = -2 \left( (y - \hat{y}(\alpha))^T W J + 2 h^T J^T W J \right) \]
This simplifies to:
\[ h^T J^T W J = (y - \hat{y}(\alpha))^T W J \]
Taking the transpose of both sides:
\[ \left[ h^T J^T W J \right]^T = \left[ (y - \hat{y}(\alpha))^T W J \right]^T \]
Which simplifies further to:
\[ J^T W J h = J^T W (y - \hat{y}(\alpha)) \]
And finally, we can solve for the update direction \(h_{\text{gn}}\):
\[ h_{\text{gn}} = \left( J^T W J \right)^{-1} J^T W (y - \hat{y}(\alpha)) \]
This gives the update direction \(h_{\text{gn}}\), which tells us how to adjust the parameters to reduce the cost.
To update the parameters, we simply add this update direction to the current parameters:
\[ \theta_{\text{new}} = \theta_{\text{old}} + h_{\text{gn}} \]
where \(h_{\text{gn}}\) is the update direction computed as:
\[ h_{\text{gn}} = \left( J^T W J \right)^{-1} J^T W (y - \hat{y}(\alpha)) \]
2.4 Summary and Comparison to Gradient Descent
In this section, we used the Gauss-Newton method to update the parameters. Here’s how it works compared to gradient descent:
Initial Guess: We start with an initial guess for the parameters, \(\theta_{\text{old}}\), which might be a random value but should be reasonable enough to start the optimization process.
Compute the Gradient: The gradient (or in this case, the first derivative) of the cost function tells us the direction to move in. It gives us a sense of the slope of the “hill” we’re trying to descend.
For gradient descent, the update rule would be:
\[ \theta_{\text{new}} = \theta_{\text{old}} - \eta \nabla_{\theta} \chi^2(\theta) \]
where \(\eta\) is the learning rate (controlling the size of the update step) and \(\nabla_{\theta} \chi^2(\theta)\) is the gradient of the cost function with respect to the parameters.
The Gauss-Newton method also uses a gradient-like update, but it takes into account second-order information (through the Jacobian matrix \(J\)) to adjust the parameters more effectively, particularly when dealing with nonlinear models.
The Gauss-Newton update rule is:
\[ \theta_{\text{new}} = \theta_{\text{old}} + \left( J^T W J \right)^{-1} J^T W (y - \hat{y}(\alpha)) \]
Learning Rate: In gradient descent, we often introduce a learning rate \(\eta\) to control the size of the updates. In Gauss-Newton, the update is determined by the matrix inversion, so we don’t need a separate learning rate, though it’s still possible to introduce one if necessary for stability.
For gradient descent:
\[ \theta_{\text{new}} = \theta_{\text{old}} + \eta J^T W (y - \hat{y}(\alpha)) \]
Second-order Information: The key difference is that the Gauss-Newton method uses second-order information (via the Jacobian matrix \(J\) and the Hessian approximation \(J^T W J\)) to better model how the cost function behaves near the current parameter values. This typically results in more efficient updates compared to gradient descent, especially for nonlinear models.
In summary, both methods aim to update the parameters \(\theta\) to minimize the cost function \(\chi^2(\alpha)\), but the Gauss-Newton method incorporates second-order information (through the Jacobian and its inverse), which allows for more accurate and faster convergence, especially when dealing with nonlinear regression models.