Sequential quadratic programming

From optimization
Revision as of 23:55, 27 May 2015 by Ben Goodman (Talk | contribs)

Jump to: navigation, search

Authored by: Ben Goodman (ChE 345 Spring 2016)
Steward: Dajun Yue and Fenqi You



Sequential quadratic programming (SQP) is a class of algorithms for solving non-linear optimization problems (NLP) in the real world. It is powerful enough for real problems because it can handle any degree of non-linearity including non-linearity in the constraints. The main disadvantage is that the method incorporates several derivatives, which likely need to be worked analytically in advance of iterating to a solution, so SQP becomes quite cumbersome for large problems with many variables or constraints. SQP combines two fundamental algorithms for solving non-linear optimization problems: an active set method and Newton’s method, both of which are explained briefly below. Previous exposure to the component methods as well as to Lagrangian multipliers and Karush-Kuhn-Tucker (KKT) conditions is helpful in understanding SQP. The abstracted, general problem below will be used for the remainder of this page to explain and discuss SQP:
 \text{min f(x)}
 \text{s.t. h(x) = 0}
 \text{and g(x)} \le 0

with f(x), h(x), and g(x) each potentially non-linear. x is potentially a vector of many variables for the optimization, in which case h(x) and g(x) are systems.


Karush-Kuhn-Tucker (KKT) Conditions and the Lagrangian Function

The Lagrangian function combines all the information about the problem into one function using Lagrangian multipliers \lambda for equality constraints and \mu for inequality constraints: L(x,\lambda,\mu)\text{ = f(x) +}\sum_i\lambda_i h_i(x)+\sum_i\mu_i g_i(x)

A single function can be optimized by finding critical points where the gradient is zero. This procedure now includes \lambda and \mu as variables (which are vectors for multi-constraint NLP). The system formed from this gradient is given the label KKT conditions:

\nabla L =\begin{bmatrix} \frac{dL}{dx} \\ \frac{dL}{d\lambda} \\ \frac{dL}{d\mu} \end{bmatrix} = \begin{bmatrix} \nabla f + \lambda \nabla h + \mu \nabla g^* \\ h \\ g^* \end{bmatrix} =0

The second KKT condition is merely feasibility; h(x) were constrained to zero in the original NLP. The third KKT condition is a bit trickier in that only the set of active inequality constraints need satisfy this equality, the active set being denoted by g^*. Inequality constraints that are nowhere near the optimal solution are inconsequential, but constraints that actively participate in determining the optimal solution will be at their limit of zero, and thus the third KKT condition holds. Ultimately, the Lagrangian multipliers describe the change in the objective function with respect to a change in a constraint, so \mu is zero for inactive constraints, so those inactive constraints can be considered removed from the Lagrangian function before the gradient is even taken.

The Active Set Method and its Limitations

The active set method solves the KKT conditions using guess and check to find critical points. Guessing that every inequality constraints is inactive is conventionally the first step. After solving the remaining system for x, feasibility can be checked. If any constraints are violated, they should be considered active in the next iteration, and if any multipliers are found to be negative, their constraints should be considered inactive in the next iteration. Efficient convergence and potentially large systems of equations are of some concern, but the main limitation of the active set method is that many of the derivative expressions in the KKT conditions could still be highly non-linear and thus difficult to solve. Indeed, only quadratic problems seem reasonable to tackle with the active set method because the KKT conditions are linear. Sequential Quadratic Programming addresses this key limitation by incorporating a means of handling highly non-linear functions: Newton's Method.

Newton's Method

The main idea behind Newton's Method is to improve a guess in proportion to how quickly the function is changing at the guess and inversely proportional to how the function is accelerating at the guess. Walking through a few extreme scenarios makes this approach more intuitive: a long, steep incline in a function will not be close to a critical point, so the improvement should be large, and a shallow incline that is rapidly expiring is likely to be near a critical point, so the improvement should be small. The iterations converge to critical values of any function f with improvement steps that follow the form below:
x_{k+1} =  x_k - \frac{\nabla f}{\nabla^2 f}

The negative sign is important. Near minimums, a positive gradient should decrease the guess and vice versa, and the divergence is positive. Near maximums, a positive gradient should increase the guess and vice versa, but the divergence is negative. This sign convention also prevents the algorithm from escaping a single convex or concave region; the improvement will reverse direction if it overshoots. This is an important consideration in non-convex problems with multiple local maximums and minimums. Newton's method will find the critical point closest to the original guess. Incorporating Newton's Method into the active set method will transform the iteration above into a matrix equation.

The SQP Algorithm

Critical points of the objective function will also be critical points of the Lagrangian function and vice versa because the Lagrangian function is equal to the objective function at a KKT point; all constraints are either equal to zero or inactive. The algorithm is thus simply iterating Newton's method to find critical points of the Lagrangian function. Since the Lagrangian multipliers are additional variables, the iteration forms a system:
 \begin{bmatrix} x_{k+1} \\ \lambda_{k+1} \\ \mu_{k+1} \end{bmatrix} =  \begin{bmatrix} x_{k} \\ \lambda_{k} \\ \mu_{k} \end{bmatrix} -Failed to parse(PNG conversion failed; check for correct installation of latex and dvipng (or dvips + gs + convert)): \nabla^2^-1 L (\nabla L)

Recall: \nabla L =\begin{bmatrix} \frac{dL}{dx} \\ \frac{dL}{d\lambda} \\ \frac{dL}{d\mu} \end{bmatrix} = \begin{bmatrix} \nabla f + \lambda \nabla h + \mu \nabla g^* \\ h \\ g^* \end{bmatrix}

Then \nabla^2 L = \begin{bmatrix} \nabla_{xx}^2 L & \nabla h & \nabla g \\ \nabla h & 0 & 0 \\ \nabla g & 0 & 0 \end{bmatrix}

L(x,\lambda,\mu)\text{ = f(x) +}\sum_i\lambda_i h_i(x)+\sum_i\mu_i g_i(x)
Previous knowledge of the component methods is helpful in understanding sequential quadratic programming. Briefly, the active set method confines the search for optimal solutions to regions where the objective function is increasing significantly with respect to constraint functions. Lagrangian Parameters and KKT conditions provide the framework to find these regions and converge to the optimum solution. The basic idea is that Lagrangian parameters represent the change in the objective with respect to the constraint, which allows the chain rule and single-variable calculus optimization approaches to be invoked.

Briefly, the main idea behind Newton's Method is to improve a guess in proportion to how quickly the function is changing at the guess and inversely proportional to how the function is accelerating at the guess. The iterations converge to critical values of any function f with improvement steps that follow the form below:
x_{k+1} =  x_k - \frac{\nabla f}{\nabla^2 f}

The final concept fundamental to SQP is Taylor Series expansions; the idea that any function can be well represented by an infinite series of polynomial terms. This concept extends to expressing derivatives as a series of polynomial deviations from a given starting point with each term scaled by the analytical derivative evaluated at the starting point. When the deviation is small, one or two terms can be used with adequate accuracy. This concept allows highly non-linear problems to be handled using linear and quadratic methods.

SQP Algorithm

As with the active set method, the Lagrangian function forms the basis of SQP:

L(x,\lambda,\mu)\text{ = f(x) +}\sum_i\lambda_i h_i(x)+\sum_i\mu_i g_i(x)

The active set method alone must be performed with only the first order term of the Taylor Series for L so that the resulting sub-problem is linear. Newton's method in tandem allows the second order term of the Taylor Series to be added, forming a quadratic sub-problem, because Newton's method converges in one iteration for quadratic problems. The quadratic sub-problem is itself a minimization problems with an improvement parameter p_x and the Lagrangian operators \lambda and \mu as the as the variables. The functions in the problem have been fed incumbent guesses for x, \lambda, and\mu, and so the functions are denoted with a subscript "k," they have been fed x_k, \lambda_k, and\mu_k. The problem is:

 \text{min  Z} =L(x_k,\lambda_k, \mu_k) +\nabla L_k *p_x +\frac{1}{2} p_x^T \nabla_{xx}^2 L_k p_x
Z is, in essence, the derivative of the objective function and has a minimum of zero in this algorithm because the improvement parameters will solve to zero once this global convergence to a critical point has been achieved.

As with the active set method alone, the Lagrangian function is used under KKT conditions:
L(x,\lambda,\mu)\text{ = f(x) +}\sum_i\lambda_i h_i(x)+\sum_i\mu_i g_i(x)
The KKT conditions form the system of equations below. In the active set method alone, this system is solved directly for values of x, \lambda, and\mu.

\begin{bmatrix} \nabla f + \lambda \nabla h + \mu \nabla g \\ h \\ g \end{bmatrix} =0

In SQP, the system is not solved but rather fed the incumbent guess for x, \lambda, and\mu. Similarly, a "second" derivative matrix is also fed the incumbent guess, and these computed values then scale terms in the system of equation for the improvement parameters p_x and p_\lambda. The "second" derivative matrix does indeed include the divergence of the Lagrangian function with respect to x, but it can be understood more accurately to be the gradient of the KKT conditions system above; derivatives with respect to each variable. This matrix is then:

\begin{bmatrix} \nabla_{xx}^2 L & \nabla h & \nabla g \\ \nabla h & 0 & 0 \\ \nabla g & 0 & 0 \end{bmatrix}

The system to solve is then:

Convergence Analysis