CPLS#

Overview#

Partial least squares for classification (CPLS) is dedicated to binary problem i.e the target variable must have two values only.

Description of the method#

The target variable is replaced by a continuous variable using a specific code [1] and the PLSR is applied.

y is the target variable, with \(y = \left\{+,-\right\}\). If \(n_{+}\) (resp. \(n_{-}\)) is the number of positive (resp. negative) in the sample and \(n = n_{+} + n_{-}\). The variable \(Z\) is defined as follows:

\[\begin{split}Z = \begin{cases} \dfrac{n_{-}}{n} & \text{if}\quad y = + \\ -\dfrac{n_{+}}{n} & \text{if}\quad y = -\end{cases}\end{split}\]

This approach produces one discriminant function \(D(X)\) such as:

\[D(X) = \beta^{T}X\]

Predictive idea#

The classification rule used in CPLS consists of assigning each individual \(i\) to the class \(\{+,-\}\) using the following rule :

\[\begin{split}\widehat{y} = \begin{cases} + & \text{if} \quad D_{i}(X) \geq 0 \\ - & \text{if} \quad D_{i}(X) < 0\end{cases}\end{split}\]

Number of components#

In CPLS procedure, we can explicitly specify the number of components, with the parameter n_components, for NIPALS [2] algorithms.

VIP#

You can use VIP (variable importance in the projection) to select predictor variables when multicollinearity exists among variables. The VIP coefficients reflects the relative importance for the selected factors.

Description#

The VIP for a feature \(j\) in CPLS model with \(H\) components is given as:

\[VIP_{j} = \sqrt{\dfrac{p}{\displaystyle \sum_{h=1}^{h=H}R^{2}\left(y,t_{h}\right)}\displaystyle \sum_{h=1}^{h=H}R^{2}\left(y,t_{h}\right) w_{j,h}^{2}}\]

where \(R^{2}\left(y,t_{h}\right)\) is the square correlation coefficient between \(y\) and \(t_{h}\); \(w_{j,h}\) is the \(x\)-weight coefficient.

Variables with a VIP score greater than \(1\) (default threshold in CPLS procedure) are considered important for the projection of the PLS regression.

Note

These selections rules must be use with caution because the VIP reflects only the relative importance (each others) of the input variables. It does not mean that a variable with a low VIP is not relevant for the classification.

Coefficients#

Coefficients are the parameters in a regression equation. The estimated coefficients are used with the predictors to calculate the fitted value of the response variable and the predicted response of new observations. In contrast to least squares, the PLS coefficients are nonlinear estimators. Standardized coefficients indicate the importance of each predictor in the model and correspond to the standardized \(x\)- and \(z\)-variables. In PLS, the coefficient matrix of shape \((p,)\) is calculated from the weights and loadings.

The formula for standardized coefficients is:

\[\beta^{std} = W\left(P^{T}W\right)^{-1}Q^{T}\]

To calculate the nonstandardized coefficients and intercept, use these formulas:

\[\begin{split}\beta_{j} & = \beta_{j}^{std} \dfrac{\sigma_{Z}}{\sigma_{j}} \\ \beta_{0} & = \mu_{Z} - \displaystyle \sum_{j} \mu_{j}\beta_{j}\end{split}\]

where:

Terms	Description
\(W\)	the \(x\)-weight matrix
\(P\)	the \(x\)-loading matrix
\(Q\)	the \(Z\)-loading matrix
\(j\)	the features \(j\)
\(p\)	the number of features

Explained variance of \(X\)#

The explained variance ratio is defined by the following formula:

\[\text{Explained variance ratio} = \dfrac{\text{Variance explained by component}}{\text{Total variance}}\]

which equal to:

\[\text{Explained variance ratio}(h) = \dfrac{\lvert \lvert t_{h}p_{h}^{T}\rvert \rvert_{F}^{2}}{\lvert \lvert X\rvert \rvert_{F}^{2}}\]