CANDISC#

Overview#

Canonical discriminant analysis is a dimension-reduction technique related to principal component analysis and canonical correlation. The methodology that is used in deriving the canonical coefficients parallels that of a one-way multivariate analysis of variance (MANOVA). MANOVA tests for equality of the mean vector across class levels. Canonical discriminant analysis finds linear combinations of the quantitative variables that provide maximal separation between classes or groups.

CANDISC is a somewhat hybrid learning nature with two setps:

Semi-supervised: On one hand, CANDISC has an unsupervised or descriptive aspect that aims to tackle the following question. How to find a representation of the objects which provides the best separation between classes?
Supervised: On the other hand, CANDISC also has a decisively supervised aspect that tackles the question: How to find the rules for assigning a class to a given object?

Semi-supervised Aspect#

Mathematical formula of the CANDISC classifier#

In Semi-supervised aspect, the gold is to find a low dimensional representation of the objects which provides the best separation between classes. We can look for an axis \(\Delta_{a}\), spanned by some columns vector \(a=(a_{1},a_{2},\dots,a_{p})\), such that the linear combinations [1]:

\[z = a_{1}(x_{1} - \overline{x}_{1})+a_{2}(x_{2} - \overline{x}_{2})+\cdots+a_{p}(x_{p} - \overline{x}_{p})\]

that separates all \(K\) groups in an adequate way.

Algebraically, the idea is to look for a linear combination of the predictors :

\[Z= (X - \mathbb{1}g^{T})a\]

that ideally could achieve the following two goals [2]:

Minimize (pooled) within-class dispersion (wss): \(\underbrace{\min}_{a}\left\{a^{T}W_{b}a\right\}\)
Maximize between-class dispersion (bss): \(\underbrace{\max}_{a}\left\{a^{T}B_{b}a\right\}\)

On one hand, it would be nice to have \(a\), such that the between-class dispersion is maximized. This corresponds to a situation in which the class centroids are well separated. On the other hand, it would also make sense to have \(a\), such that the within-class dispersion is minimized. This implies having classes in which, on average, the “inner” variation is small (i.e. concentrated local dispersion).

Can we find such a mythical vector \(a\)?

Looking for a Compromise Criterion#

So far we have an impossible simultaneity involving a minimization criterion, as well as a maximization criterion:

\[\begin{split}\begin{eqnarray} \underbrace{\min}_{a}\left\{a^{T}W_{b}a\right\} & \Longrightarrow & W_{b}a = \lambda a \\ & \text{and} & \\ \underbrace{\max}_{a}\left\{a^{T}B_{b}a\right\} & \Longrightarrow & B_{b}a = \rho a \end{eqnarray}\end{split}\]

What can we do to look for a compromise. Using the Huygens theoreom, the variance can be deomposed as:

\[V_{b} = W_{b} + B_{b}\]

Doing some algebra, it can be shown that the quadratic form \(a^{T}V_{b}a\) can be decomposed as:

\[a^{T}V_{b}a = a^{T}W_{b}a + a^{T}B_{b}a\]

Again, we are pursuing a dual goal that is, in general, hard to accomplish:

\[a^{T}V_{b}a = \underbrace{a^{T}W_{b}a}_{\text{minimize}} + \underbrace{a^{T}B_{b}a}_{\text{maximize}}\]

We have two options for the compromise:

\[\underbrace{\max}_{a}\left\{\dfrac{a^{T}B_{b}a}{a^{T}V_{b}a}\right\} \quad \text{or} \quad \underbrace{\max}_{a}\left\{\dfrac{a^{T}B_{b}a}{a^{T}W_{b}a}\right\}\]

which are actually associated to the following ratios:

\[\eta^{2} = \dfrac{a^{T}B_{b}a}{a^{T}V_{b}a} \quad \text{and} \quad F = \dfrac{a^{T}B_{b}a}{a^{T}W_{b}a}\]

where \(\eta^{2}\) is the correlation ratio and \(F\) the \(F\)-ratio.

Correlation Ratio#

If we decide to work with the first criterion, we look for \(a\) such that:

\[\underbrace{\max}_{a}\left\{\dfrac{a^{T}B_{b}a}{a^{T}V_{b}a}\right\}\]

This criterion is scale invariant, meaning that we use any scale variation of \(a\): i.e. \(\alpha a\). For convenience, we can impose a normalizing restriction: \(a^{T}V_{b}a=1\). Consequently.

\[\underbrace{\max}_{a}\left\{\dfrac{a^{T}B_{b}a}{a^{T}V_{b}a}\right\} \quad \Longleftrightarrow \quad \underbrace{\max}_{a}\left\{a^{T}B_{b}a\right\}\quad \text{s.t.}\quad a^{T}V_{b}a=1\]

Using the method of Lagrangien multiplier:

\[\mathcal{l}(a,\lambda) = a^{T}B_{b}a - \lambda(a^{T}V_{b}a - 1)\]

Deriving w.r.t \(a\) and equating to zero:

\[\dfrac{\partial \mathcal{l}}{a} = 2B_{b}a - 2\lambda V_{b}a= 0\]

The optimal vector \(a\) is such that:

\[B_{b}a = \lambda V_{b}a\]

If the matrix \(V_{b}\) is inversible, which it is in general, then :

\[V_{b}^{-1}B_{b}a = \lambda a\]

that is, the optimal vector \(a\) is eigenvector of \(V_{b}^{-1}B_{b}\). Keep in mind that, in general, \(V_{b}^{-1}B_{b}\) is not symmetric.

\(F\)-ratio Criterion#

Now, if we decide to work with the criterion associated to the F ratio, then the criterion to be maximized is:

\[\underbrace{\max}_{a}\left\{\dfrac{a^{T}B_{b}a}{a^{T}W_{b}a}\right\} \quad \Longleftrightarrow \quad \underbrace{\max}_{a}\left\{a^{T}B_{b}a\right\} \quad \text{s.t.}\quad a^{T}W_{b}a=1\]

Applying the same Lagrangien procedure, with a multiplier \(\rho\), we have that \(a\) is such a vector that:

\[B_{b}a = \rho W_{b}a\]

and if \(W_{b}\) is inversible, which it is in most cases, then it can be shown that \(a\) is also eigenvector of \(W_{b}^{-1}B_{b}\), associated to eigenvalue \(\rho\):

\[W_{b}^{-1}B_{b}a = \rho a\]

Relationshop between eigenvalues#

The relationship between the eigenvalues \(\lambda\) and \(\rho\) is :

\[\rho = \dfrac{\lambda}{1 - \lambda}\]

Note

\(\lambda = \eta^{2}\) correspond to the correlation ratio and is range between \(0\) and \(1\) (\(0 \leq \lambda \leq 1\)).
\(\sqrt{\lambda} = \eta\) correspond to the canonical correlation.
\(\lambda\) are not additive from one factor to another.
Eiganvalues \(\rho\) are added together from one factor to another.

Raw Canonical coefficients#

Coefficients \(a_{h} (h=1,\dots,H)\) are obtained using the following formula:

\[a_{h} = \left(V_{b}^{-1}C\right)b_{h}\sqrt{\dfrac{n-K}{n\times \lambda_{h} \times \left( 1 - \lambda_{h}\right)}}\]

where \(b_{h}\) is the eigenvector of \(C^{T}V_{b}^{-1}C\) and \(C\), a matrix of shape \((p,K)\) such as \(V_{b}=CC^{T}\) and

\[c_{pk} = \sqrt{\dfrac{n_{k}}{n}}\left(\overline{x}_{kj} - \overline{x}_{j}\right)\]

The intercept, \(a_{h0}\), correspond to:

\[a_{h0} = - \displaystyle \sum_{j=1}^{j=p}a_{hj}\overline{x}_{j}\]

Supervised Aspect#

The supervised learning aspect of CANDISC has to do with the question: how do we use it for classification purposes? This involves establishing a decision rule that let us predict the class of an object. CANDISC proposes a geometric rule of classification.

Distance behind CANDISC#

The squared Euclidean distance between two vectors \(x_{i}\) and \(x_{i^{'}}\) is defined as:

\[d_{E}^{2}\left(i,i^{'}\right) = \left(x_{i}-x_{i^{'}}\right)^{T}\left(x_{i}-x_{i^{'}}\right)\]

The squared Euclidean distance between the vector \(x_{i}\) and the coordinates of the centroids \(g_{k}\) is defined as:

\[d_{E}^{2}\left(i,\mathcal{C}_{k}\right) = \left(x_{i} - g_{k}\right)^{T}\left(x_{i} - g_{k}\right)\]

The generalized squared distance between the vector \(x_{i}\) and the coordinates of the centroids \(g_{k}\) is defined as:

\[d_{M}^{2}\left(i,\mathcal{C}_{k}\right) = d_{E}^{2}\left(i,\mathcal{C}_{k}\right) - 2 \times \log \left(\widehat{\pi}_{k}\right)\]

where \(\widehat{\pi}_{k}\) is the prior probability of class \(\mathcal{C}_{k}\).

Predictive idea#

The classification rule used in CANDISC consists of assigning each individual \(x_{i}\) to the class \(\mathcal{C}_{k}\) for which the distance to the centroid is minimal. For that, we use a variant of softmax transformation for estimated probabilities:

\[\mathbb{P}\left(Y=y_{k}/X = x_{i}\right) = \dfrac{e^{-0.5\times d_{M}^{2}\left(i,\mathcal{C}_{k}\right)}}{\displaystyle \sum_{c=1}^{c=K}e^{-0.5\times d_{M}^{2}\left(i,\mathcal{C}_{c}\right)}}\]

Classification Functions Coefficients#

From the generalized square distance, it is possible to deduct the classification function coefficients.

\[\begin{split} \begin{eqnarray} -\dfrac{1}{2}d_{M}^{2}\left(i,\mathcal{C}_{k}\right) & = & - \dfrac{1}{2}\displaystyle \sum_{h=1}^{h=H}\left(z_{h}(i) - \overline{z}_{kh}\right)^{2}+\log \left(\widehat{\pi}_{k}\right)\\ & = & -\dfrac{1}{2}\displaystyle \sum_{h=1}^{h=H}z_{h}(i)^{2} -\dfrac{1}{2}\displaystyle \sum_{h=1}^{h=H}\overline{z}_{kh}^{2} + \displaystyle \sum_{h=1}^{h=H}z_{h}(i)\overline{z}_{kh} + \log \left(\widehat{\pi}_{k}\right) \end{eqnarray}\end{split}\]

Since the canonical discriminant function is a linear function of original variables :

\[z_{h}(x) = a_{h0}+a_{h1}x_{1}+a_{h2}x_{2}+\cdots+a_{hp}x_{p}\]

we can deduct the linear expression of the classification function for the CANDISC:

\[S\left(y_{k},i\right) = \beta_{k0} + \beta_{k1}x_{1} + \beta_{k2}x_{2} + \cdots + \beta_{kp}x_{p}\]

where \(\beta_{k0} = \log \left(\widehat{\pi}_{k}\right) + \displaystyle \sum_{h}^{h=H}a_{h0}\overline{z}_{kh} -\dfrac{1}{2}\displaystyle \sum_{h=1}^{h=H}\overline{z}_{kh}^{2}\) and \(\beta_{kj} = \displaystyle \sum_{h=1}^{h=H}a_{hj}\overline{z}_{kh}\).

For more details about canonical discriminant analysis, see SAS CANDISC Procedure.

Footnotes

References