Speech is a continuous signal, which means that consecutive samples of the signal are correlated (see figure on the right). In particular, if we know a previous sample *x _{n-1}*, we can make a

*prediction*of the current sample, \( \hat x_n = x_{n-1}, \) such that \( \hat x_n \approx x_n. \) By using more previous samples we have more information, which should help us make a better prediction. Specifically, we can define a predictor which uses

*M*previous samples to predict the current sample

*x*as

_{n }This is a *linear predictor* because it takes a linearly weighted sum of past components to predict the current one.

The error of the prediction, also known as the *prediction residual* is

where *a _{0}=1*. This explains why the definition
\( \hat x_n \)
included a minus sign; when we calculate the residual, the double negative disappears and we can collate everything into one summation.

A short segment of speech. Notice how consecutive samples are mostly near each other, which means that consecutive samples are correlated.

Using vector notation, we can make the expressions more compact

\[ e = Xa \]where

\[ e = \begin{bmatrix}e_0\\e_1\\\vdots\\e_{N-1}\end{bmatrix},\qquad X = \begin{bmatrix}x_0 & x_{-1} & \dots & x_{M} \\x_1 & x_0 & \dots & x_{M-1} \\ \vdots & \vdots & & \vdots \\ x_{N-1} & x_{N-2} & \dots & x_{N-M} \end{bmatrix}, \qquad a = \begin{bmatrix}a_0\\a_1\\\vdots\\a_{M}\end{bmatrix}. \]Here we calculated the residual for a length *N* frame of the signal.

Vector *a* holds the unknown coefficients of the predictor. To find the best possible predictor, we can minimize the minimum mean-square error (MMSE). The square error is the 2-norm of the residual,
\( \|e\|^2=e^T e \)
. The mean of that error is defined as the expectation

where
\( R_x = E\left[X^T X\right] \)
and
\( E\left[\cdot\right] \)
is the expectation operator. Note that, as shown in the autocorrelation section, the matrix *R _{x}*, can be usually assumed to have a symmetric Toeplitz structure.

If we would directly minimize the mean-square error
\( E\left[\|e\|^2\right], \)
then clearly we would obtain the trivial solution *a=0*, which is not particularly useful. However that solution contradicts with the requirement that the first coefficient is unity, *a _{0}=1*. In vector notation we can equivalently write

The standard method for quadratic minimization with constraints is to use a Langrange multiplier, λ, such that the objective function is

\[ \eta(a,\lambda) = a^T R_x a - 2\lambda\left(a^T u - 1\right). \]This function can be heuristically interpreted such that λ is a free parameter. Since our objective is to minimize \( a^T R_x a \) if \( a^T u - 1 \) is non-zero, then the objective function can become arbitrarily large. To allow any value for λ, the constraint must therefore be zero.

The objective function is then minimized by setting its derivative with respect to *a* to zero

It follows that the optimal predictor coefficients are found by solving

\[ R_x a = \lambda u. \]Since *R _{x}*, is symmetric and Toeplitz, the above system of equations can be efficiently solved using the Levinson-Durbin algorithm with algorithmic complexity

*O(M*. However, note that with direct solution we obtain \( a':=\frac1\lambda a = R_x^{-1}u \) that is, instead of

^{2})*a*we get

*a*scaled with λ. However, since we know that

*a*, we can find

_{0}=1*a*by \( a=\lambda a' = \frac{a'}{a'_0}. \)

Linear prediction is usually used to predict the current sample of a time-domain signal *x _{n}*. The usefulness of linear prediction however becomes evident by studying its Fourier spectrum Specifically, since

*e=Xa*, the corresponding Z-domain representation is

where *E(z)*, *X(z)*, and *A(z)*, are the Z-transforms of *e _{n}*,

*x*and

_{n}*a*, respectively. The residual

_{n}*E(z)*is white-noise, whereby the inverse

*A(z)*, must follow the shape of

^{-1}*X(z)*.

In other words, the linear predictor models the macro-shape or *envelope* of the spectrum.