An analysis of the Full-Matrix method, and all gradient descent methods begins with the Taylor's series expansion of the function being minimized. For a generic function ( ) the Taylor's expansion is
where is the current set of parameters of the model. In all cases the additional terms (represented by `` '') are ignored. This assumption has considerable consequences which will be discussed later.
We can change the nomenclature used in equation 2 to more closely match those in refinement by defining to be the parameters of the current model and to be a ``shift vector'' which we want to add to . is equal to . The new version of Equation 2 is
and its derivative is
Since the first and second derivatives can be calculated given any particular value for this equation allows the gradient of the function to be calculated given any shift vector. In addition the equation can be inverted to allow the shift vector to be calculated given the gradient of the function.
At the minimum (or maximum) of a function all components of the gradient are zero. Therefore we should be able to calculate the shift vector between the current model ( ) and the minimum. The equation for this is simple --
The Full-Matrix method is to use this equation, evaluated with the current parameters, to calculate . is then added to to give the set of parameters which cause the function to be minimal, and in the case of refinement the best fit to the observations.
This method sounds great. One calculates a single expression and the minimum is discovered. When fitting a line to a set of points this is exactly what is done. In refinement something is obviously different. The difference arrises from the `` '' which we choose to ignore. In the case of fitting a line to points the terms represented by `` '' in fact are zero. The truncated Taylor's series is exact and the shift vector is also exact. In refinement these terms are not equal to zero resulting in the shift vector giving only the approximate location of the minimum.
The quality of the estimate is limited by the size of the terms which are ignored. The terms of the Taylor's series have increasing powers of . The first term ignored is multiplied by and the higher order terms are multiplied by ever higher powers. If is small these higher order terms become quite small also. Therefore the closer is to the minimum the better estimate becomes.
The Full-Matrix method, and all the gradient descent methods which are derived from it, becomes a series of successive approximations. An initial guess for the parameters of the model ( ) is manufactured somehow. For the shift vector to actually give an improved set of parameters the guess must be sufficiently close to the minimum. The ``sufficiently close'' criteria is rather difficult to calculate exactly.
The distance from the minimum at which a minimization method breakdown is called the ``radius of convergence''. It is clear is that the Full-Matrix method is much more restrictive than the gradient descent methods, and the gradient descent methods are more restrictive than simulated annealing, Metropolis, and Monte Carlo methods. Basically the less information about the function calculated at a particular point the larger the radius of convergence will be.
The property of the Full Matrix method which compensates for its restricted radius of convergence is its ``power of convergence''. If the starting model is within the radius of the Full Matrix method that method will be able to bring the model to the minimum quicker than any other method.