DISCRIM#

Overview#

Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) are two classifiers, with, as their names suggest, a linear and a quadratic decision surface, respectively.

For a set of observations containing one or more quantitative variables and a classification variable defining groups of observations, the DISCRIM class develops a discriminant criterion to classify each observation into one of the groups.

The DISCRIMN assume that the distribution within each group is multivariate normal. The discriminant function, also known as a classification criterion, is determined by a measure of generalized squared distance (Rao 1973). The classification criterion can be based on either the individual within-group covariance matrices (yielding a quadratic function) or the pooled covariance matrix (yielding a linear function); it also takes into account the prior probabilities of the groups.

Mathematical formula of the LDA and QDA#

Both LDA and QDA cen be derived from simple probabilistic models which model the class conditional distribution of the \(\mathbb{P}\left(X|Y=y_{k}\right)\) for each class \(k\). Predictions can be obtained by using Bayes’ rule, for each training sample \(x \in \mathcal{R}^{p}\).

\[\mathbb{P}\left(y=k|x\right) = \dfrac{\mathbb{P}\left(x|y=k\right)\mathbb{P}(y=k)}{\mathbb{P}\left(x\right)} = \dfrac{\mathbb{P}\left(x|y=k\right)\mathbb{P}(y=k)}{\displaystyle \sum_{l}\mathbb{P}\left(x|y=l\right)\mathbb{P}(y=l)}\]

and we select the class \(k\) which maximizes this posterior probability.

More specifically, for linear and quadratic discriminant analysis, \(\mathbb{P}\left(x|y\right)\) is modeled as a multivariate gaussian distribution with density:

\[\mathbb{P}\left(x|y=k\right) = \dfrac{1}{\left(2\pi\right)^{p/2}\lvert \Sigma_{k}\rvert^{1/2}}\text{exp}\left(-\dfrac{1}{2}\left(x-\mu_{k}\right)^{T} \Sigma_{k}^{-1}\left(x-\mu_{k}\right)\right)\]

Quadratic Discriminant Analysis#

According to the model above, the log of the posterior probability is:

\[\begin{split}\begin{eqnarray} \log \mathbb{P}\left(y=k|x\right) & = & \log \mathbb{P}\left(x|y=k\right)+\log \mathbb{P}\left(y=k\right) - \log \mathbb{P}(x)\\ & = & - \dfrac{p}{2}\log\left(2\pi\right)-\dfrac{1}{2}\log \lvert \Sigma_{k} \rvert -\dfrac{1}{2}\left(x-\mu_{k}\right)^{T} \Sigma_{k}^{-1}\left(x-\mu_{k}\right) +\log \mathbb{P}\left(y=k\right) - \log \mathbb{P}(x) \\ & = & -\dfrac{1}{2}\log \lvert \Sigma_{k} \rvert -\dfrac{1}{2}\left(x-\mu_{k}\right)^{T} \Sigma_{k}^{-1}\left(x-\mu_{k}\right) +\log \mathbb{P}\left(y=k\right) + Cst \end{eqnarray}\end{split}\]

The predicted class is the one that maximises this log-posterior probability.

Linear Discriminant Analysis#

LDA is a special case of QDA, where the Gaussian for each class are assumed to share the same covariance matrix: \(\Sigma_{k}=\Sigma\) for all \(k\). This reduces the log posterior probability to:

\[\log \mathbb{P}\left(y=k|x\right) = -\dfrac{1}{2}\left(x-\mu_{k}\right)^{T} \Sigma^{-1}\left(x-\mu_{k}\right) +\log \mathbb{P}\left(y=k\right) + Cst\]

The term \(\left(x-\mu_{k}\right)^{T} \Sigma^{-1}\left(x-\mu_{k}\right)\) corresponds to the Mahalanobis between the sample \(x\) and the mean \(\mu_{k}\).

Note

The Mahalanobis distance tells how close \(x\) is from \(\mu_{k}\), while also accounting for the variance of each feature. WE can thus interpret LDA as assigning \(x\) to the class whose mean is the closest in terms of Mahalanobis distance, while also accounting for the class prior probabilities.

The log posterior probability of LDA can also be written [1] as:

\[\log \mathbb{P}\left(y=k|x\right) = \beta_{k0} + \beta_{k}^{T}x\]

where \(\beta_{k} = \Sigma^{-1}\mu_{k}\) and \(\beta_{k0} = \log \mathbb{P}\left(y=k\right) - \dfrac{1}{2}\mu_{k}^{T}\Sigma^{-1}\mu_{k}\).

Note

From the above formula, it is clear that linear discriminant analysis has a linear decision surface. In this case of quadratic discriminant analysis, the are no assumptions on the covariance matrices \(\Sigma_{k}\) of the Gaussians, leading to quadratic decision surface. See [2] for more details.

Estimating Parameters of Normal Distributions#

In any practical use, we need to estmate some of these quantities:

  • Priors probabilities: \(\widehat{\pi}_{k}\)

  • Mean-vectors: \(\widehat{\mu}_{k}\)

  • Variance-covariance matrices : \(\widehat{\Sigma}_{k}\)

Priors#

Estimating \(\pi_{k}\) is relatively intuitive:

\[\widehat{\pi}_{k} = \dfrac{n_{k}}{n}\]

where \(n_{k} = lvert \mathcal{C}_{k}\rvert\) denotes the size of class \(k\) and \(n\) denotes the total number of data points.

Mean vectors#

For \(\widehat{\mu}_{k}\), we can use the centroid of \(\mathcal{C}_{k}\); i.e. the average individual of class \(k\):

\[\widehat{\mu}_{k} = g_{k}\]

Variance-Covariance matrices#

For \(\widehat{\Sigma}_{k}\), we can use the within-variance matrix:

\[\widehat{\Sigma}_{k} = \dfrac{1}{n_{k}-1} X_{k}^{T}X_{k}\]

where \(X_{k}\) is the mean-centered data matrix for objects of class \(\mathcal{C}_{k}\).

Given all of above estimations, we can estimate the posterior probability.

Quadratic Discriminant Analysis#

For quadratic discriminant analysis, we have :

\[\widehat{\delta}_{k}\left(x\right) = \underbrace{-\dfrac{1}{2}x^{T}\widehat{\Sigma}_{k}^{-1}x}_{\text{quadratic}} - \underbrace{x^{T}\widehat{\Sigma}_{k}^{-1}\widehat{\mu}_{k}}_{\text{linear}} + \underbrace{\log \left(\widehat{\pi}_{k}\right) - \dfrac{1}{2}\log \lvert \widehat{\Sigma}_{k} \rvert - \dfrac{1}{2}\widehat{\mu}_{k}^{T}\widehat{\Sigma}_{k}^{-1}\widehat{\mu}_{k}}_{\text{constant}}\]

Having a quadratic discriminant function causes the decision boundaries in quadratic discriminant analysis to be quadratic surfaces.

Linear Discriminant Analysis#

For linear discriminant analysis, since all covariances matrices are the same, the pooled within-class matrix is defined as :

\[\widehat{\Sigma} = \dfrac{1}{n-K} \displaystyle \sum_{k} \left(n_{k}-1\right)\widehat{\Sigma}_{k}\]

Our expression for \(\widehat{\delta}_{k}\left(x\right)\), becomes (after ignoring termes that do not depend on \(k\)):

\[\widehat{\delta}_{k}\left(x\right) = - \underbrace{x^{T}\widehat{\Sigma}^{-1}\widehat{\mu}_{k}}_{\text{linear}} + \underbrace{\log \left(\widehat{\pi}_{k}\right) - \dfrac{1}{2}\widehat{\mu}_{k}^{T}\widehat{\Sigma}^{-1}\widehat{\mu}_{k}}_{\text{constant}}\]

For more details about linear and quadratic discriminant analysis, see SAS DISCRIM Procedure.

References