STEPDISC CANDISC - oliveoil dataset#

[1]:
#disable warnings
from warnings import simplefilter, filterwarnings
simplefilter(action='ignore', category=FutureWarning)
filterwarnings("ignore")

oliveoil dataset#

[2]:
#vins dataset
from discrimintools.datasets import load_oliveoil
D = load_oliveoil("train")
print(D.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   CLASSE       569 non-null    object
 1   palmitic     569 non-null    int64
 2   palmitoleic  569 non-null    int64
 3   stearic      569 non-null    int64
 4   oleic        569 non-null    int64
 5   linoleic     569 non-null    int64
 6   linolenic    569 non-null    int64
 7   arachidic    569 non-null    int64
 8   eicosenoic   569 non-null    int64
dtypes: int64(8), object(1)
memory usage: 40.1+ KB
None

Forward selection#

[3]:
from discrimintools import CANDISC, STEPDISC
#split into X and y
y, X = D["CLASSE"], D.drop(columns=["CLASSE"])
clf = CANDISC(n_components=2).fit(X,y)
clf2 = STEPDISC(method="forward",alpha=0.01,verbose=True)
clf2.fit(clf)

====================== Step 1 forward selection results =======================
             Wilks' Lambda  Partial R-Square      F Value  Num DF  Den DF  \
palmitic          0.538509          0.461491   242.524854       2     566
palmitoleic       0.604905          0.395095   184.841942       2     566
stearic           0.998272          0.001728     0.489942       2     566
oleic             0.473479          0.526521   314.703134       2     566
linoleic          0.550371          0.449629   231.198312       2     566
linolenic         0.687722          0.312278   128.503464       2     566
arachidic         0.662890          0.337110   143.918675       2     566
eicosenoic        0.202071          0.797929  1117.498522       2     566

                      Pr>F
palmitic      8.465810e-77
palmitoleic   1.650063e-62
stearic       6.129213e-01
oleic         1.288711e-92
linoleic      4.032628e-74
linolenic     9.724383e-47
arachidic     2.936859e-51
eicosenoic   2.867939e-197

Variable eicosenoic will enter


====================== Step 2 forward selection results =======================
             Wilks' Lambda  Partial R-Square     F Value  Num DF  Den DF  \
palmitic          0.130129          0.356025  156.181332       2     565
palmitoleic       0.123582          0.388421  179.418881       2     565
stearic           0.184593          0.086494   26.748148       2     565
oleic             0.102388          0.493307  275.036355       2     565
linoleic          0.094108          0.534283  324.091029       2     565
linolenic         0.195821          0.030927    9.015798       2     565
arachidic         0.139761          0.308355  125.946400       2     565

                     Pr>F
palmitic     1.012924e-54
palmitoleic  4.708670e-61
stearic      7.960766e-12
oleic        3.895257e-84
linoleic     1.756295e-94
linolenic    1.398527e-04
arachidic    5.848639e-46

Variable linoleic will enter


====================== Step 3 forward selection results =======================
             Wilks' Lambda  Partial R-Square     F Value  Num DF  Den DF  \
palmitic          0.064624          0.313297  128.658103       2     564
palmitoleic       0.054167          0.424414  207.935167       2     564
stearic           0.088816          0.056230   16.801620       2     564
oleic             0.070818          0.247485   92.743401       2     564
linolenic         0.078805          0.162609   54.760308       2     564
arachidic         0.064452          0.315126  129.754781       2     564

                     Pr>F
palmitic     9.306291e-47
palmitoleic  2.244706e-68
stearic      8.170662e-08
oleic        1.504050e-35
linolenic    1.843977e-22
arachidic    4.386814e-47

Variable palmitoleic will enter


====================== Step 4 forward selection results =======================
           Wilks' Lambda  Partial R-Square     F Value  Num DF  Den DF  \
palmitic        0.051660          0.046294   13.664275       2     563
stearic         0.053220          0.017479    5.007993       2     563
oleic           0.051666          0.046172   13.626478       2     563
linolenic       0.045867          0.153235   50.941613       2     563
arachidic       0.039318          0.274139  106.315330       2     563

                   Pr>F
palmitic   1.604028e-06
stearic    6.985163e-03
oleic      1.662907e-06
linolenic  4.626894e-21
arachidic  6.764564e-40

Variable arachidic will enter


====================== Step 5 forward selection results =======================
           Wilks' Lambda  Partial R-Square    F Value  Num DF  Den DF  \
palmitic        0.036354          0.075390  22.912027       2     562
stearic         0.038407          0.023180   6.668090       2     562
oleic           0.037676          0.041772  12.249610       2     562
linolenic       0.034623          0.119397  38.099399       2     562

                   Pr>F
palmitic   2.718440e-10
stearic    1.373760e-03
oleic      6.205177e-06
linolenic  3.042803e-16

Variable linolenic will enter


====================== Step 6 forward selection results =======================
          Wilks' Lambda  Partial R-Square    F Value  Num DF  Den DF      Pr>F
palmitic       0.032986          0.047282  13.920669       2     561  0.000001
stearic        0.034059          0.016309   4.650441       2     561  0.009929
oleic          0.034059          0.016292   4.645700       2     561  0.009975

Variable palmitic will enter


====================== Step 7 forward selection results =======================
         Wilks' Lambda  Partial R-Square   F Value  Num DF  Den DF      Pr>F
stearic       0.032434          0.016756  4.771705       2     560  0.008813
oleic         0.032173          0.024665  7.080984       2     560  0.000918

Variable oleic will enter


====================== Step 8 forward selection results =======================
         Wilks' Lambda  Partial R-Square   F Value  Num DF  Den DF     Pr>F
stearic        0.03196           0.00662  1.862669       2     559  0.15622

No variable can enter

[3]:
STEPDISC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Selected variables#

[4]:
#selected variables
print(clf2.summary_.selected)
['eicosenoic', 'linoleic', 'palmitoleic', 'arachidic', 'linolenic', 'palmitic', 'oleic']

summary#

[5]:
from discrimintools import summarySTEPDISC
summarySTEPDISC(clf2)
                     Stepwise Discriminant Analysis - Results

====================== Before forward selection  =======================

                     Canonical Discriminant Analysis - Results

Summary Information:
               infos  Value                  DF  DF value
0  Total Sample Size    569            DF Total       568
1          Variables      8   DF Within Classes       566
2            Classes      3  DF Between Classes         2

Class Level Information:
              Frequency  Proportion  Prior Probability
Centre_North        150      0.2636             0.2636
Sardinia             97      0.1705             0.1705
South               322      0.5659             0.5659

Total-Sample Class Means:
             Centre_North   Sardinia      South
palmitic        1094.8333  1112.0619  1332.3696
palmitoleic       83.8933    96.3505   154.8882
stearic          231.0400   226.3505   228.7081
oleic           7791.9733  7266.9072  7099.5311
linoleic         727.8800  1197.3608  1034.0093
linolenic         21.7467    27.0103    38.0373
arachidic         37.5467    73.0000    63.1025
eicosenoic         1.9733     1.9278    27.3323

Importance of components:
      Eigenvalue  Difference  Proportion  Cumulative
Can1      8.4718      6.1684     78.6232     78.6232
Can2      2.3034         NaN     21.3768    100.0000

Raw Canonical and Classification Functions Coefficients:
                Can1     Can2  Centre_North  Sardinia    South
Constant    -13.0646 -56.9169      -70.0899  194.6549 -37.1812
palmitic      0.0028   0.0089        0.0072   -0.0344   0.0070
palmitoleic   0.0131   0.0184       -0.0095   -0.0959   0.0333
stearic      -0.0028   0.0043        0.0171   -0.0029  -0.0071
oleic         0.0006   0.0062        0.0094   -0.0199   0.0016
linoleic      0.0011  -0.0013       -0.0061   -0.0001   0.0029
linolenic     0.0411   0.0058       -0.1257   -0.1523   0.1045
arachidic    -0.0173  -0.0347       -0.0063    0.1565  -0.0442
eicosenoic    0.1631   0.0101       -0.5231   -0.5673   0.4146

====================== After forward selection  =======================

                     Canonical Discriminant Analysis - Results

Summary Information:
               infos  Value                  DF  DF value
0  Total Sample Size    569            DF Total       568
1          Variables      7   DF Within Classes       566
2            Classes      3  DF Between Classes         2

Class Level Information:
              Frequency  Proportion  Prior Probability
Centre_North        150      0.2636             0.2636
Sardinia             97      0.1705             0.1705
South               322      0.5659             0.5659

Total-Sample Class Means:
             Centre_North   Sardinia      South
eicosenoic         1.9733     1.9278    27.3323
linoleic         727.8800  1197.3608  1034.0093
palmitoleic       83.8933    96.3505   154.8882
arachidic         37.5467    73.0000    63.1025
linolenic         21.7467    27.0103    38.0373
palmitic        1094.8333  1112.0619  1332.3696
oleic           7791.9733  7266.9072  7099.5311

Importance of components:
      Eigenvalue  Difference  Proportion  Cumulative
Can1      8.4496      6.1603     78.6823     78.6823
Can2      2.2893         NaN     21.3177    100.0000

Raw Canonical and Classification Functions Coefficients:
                Can1     Can2  Centre_North  Sardinia    South
Constant    -32.1378 -27.9172       46.3443  174.6732 -85.3694
eicosenoic    0.1649   0.0067       -0.5330   -0.5656   0.4187
linoleic      0.0029  -0.0040       -0.0171    0.0018   0.0074
palmitoleic   0.0150   0.0155       -0.0210   -0.0940   0.0381
arachidic    -0.0157  -0.0374       -0.0170    0.1583  -0.0398
linolenic     0.0435   0.0020       -0.1403   -0.1498   0.1105
palmitic      0.0048   0.0058       -0.0050   -0.0324   0.0121
oleic         0.0025   0.0034       -0.0021   -0.0179   0.0064

Evaluation of prediction on testing dataset#

[6]:
#testining data
DTest = load_oliveoil("test")
#split into X and y
yTest, XTest = DTest["CLASSE"], DTest.drop(columns=["CLASSE"])
#evaluation of prediction on testing dataset
eval_test = clf2.eval_predict(XTest,yTest,verbose=True)
Observation Profile:
                        Read  Used
Number of Observations     3     3

Number of Observations Classified into CLASSE:
prediction    Centre_North  Sardinia  South  Total
CLASSE
Centre_North             1         0      0      1
Sardinia                 0         1      0      1
South                    0         0      1      1
Total                    1         1      1      3

Percent Classified into CLASSE:
prediction    Centre_North    Sardinia       South  Total
CLASSE
Centre_North    100.000000    0.000000    0.000000  100.0
Sardinia          0.000000  100.000000    0.000000  100.0
South             0.000000    0.000000  100.000000  100.0
Total            33.333333   33.333333   33.333333  100.0
Priors            0.263620    0.170475    0.565905    NaN

Error Count Estimates for CLASSE:
        Centre_North  Sardinia     South  Total
Rate         0.00000  0.000000  0.000000    0.0
Priors       0.26362  0.170475  0.565905    NaN

Classification Report for CLASSE:
              precision  recall  f1-score  support
Centre_North        1.0     1.0       1.0      1.0
Sardinia            1.0     1.0       1.0      1.0
South               1.0     1.0       1.0      1.0
accuracy            1.0     1.0       1.0      1.0
macro avg           1.0     1.0       1.0      3.0
weighted avg        1.0     1.0       1.0      3.0

backward selection#

[7]:
#backward selection
clf2 = STEPDISC(method="backward",alpha=0.01,verbose=True)
clf2.fit(clf)

====================== Step 1 backward selection results =======================
         Wilks' Lambda  Partial R-Square      F Value  Num DF  Den DF  Pr>F
stearic            1.0           0.96804  8465.862059       2     559   0.0

No variable can be removed


Since only one feature is selected, CANDISC procedure cannot be updated.
[7]:
STEPDISC(method='backward')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Selected variables#

[8]:
#selected variables
print(clf2.summary_.selected)
['stearic']

Summary#

[9]:
from discrimintools import summarySTEPDISC
summarySTEPDISC(clf2)
                     Stepwise Discriminant Analysis - Results

====================== Before backward selection  =======================

                     Canonical Discriminant Analysis - Results

Summary Information:
               infos  Value                  DF  DF value
0  Total Sample Size    569            DF Total       568
1          Variables      8   DF Within Classes       566
2            Classes      3  DF Between Classes         2

Class Level Information:
              Frequency  Proportion  Prior Probability
Centre_North        150      0.2636             0.2636
Sardinia             97      0.1705             0.1705
South               322      0.5659             0.5659

Total-Sample Class Means:
             Centre_North   Sardinia      South
palmitic        1094.8333  1112.0619  1332.3696
palmitoleic       83.8933    96.3505   154.8882
stearic          231.0400   226.3505   228.7081
oleic           7791.9733  7266.9072  7099.5311
linoleic         727.8800  1197.3608  1034.0093
linolenic         21.7467    27.0103    38.0373
arachidic         37.5467    73.0000    63.1025
eicosenoic         1.9733     1.9278    27.3323

Importance of components:
      Eigenvalue  Difference  Proportion  Cumulative
Can1      8.4718      6.1684     78.6232     78.6232
Can2      2.3034         NaN     21.3768    100.0000

Raw Canonical and Classification Functions Coefficients:
                Can1     Can2  Centre_North  Sardinia    South
Constant    -13.0646 -56.9169      -70.0899  194.6549 -37.1812
palmitic      0.0028   0.0089        0.0072   -0.0344   0.0070
palmitoleic   0.0131   0.0184       -0.0095   -0.0959   0.0333
stearic      -0.0028   0.0043        0.0171   -0.0029  -0.0071
oleic         0.0006   0.0062        0.0094   -0.0199   0.0016
linoleic      0.0011  -0.0013       -0.0061   -0.0001   0.0029
linolenic     0.0411   0.0058       -0.1257   -0.1523   0.1045
arachidic    -0.0173  -0.0347       -0.0063    0.1565  -0.0442
eicosenoic    0.1631   0.0101       -0.5231   -0.5673   0.4146

No model has been updated.