STEPDISC CANDISC - oliveoil dataset#
[1]:
#disable warnings
from warnings import simplefilter, filterwarnings
simplefilter(action='ignore', category=FutureWarning)
filterwarnings("ignore")
oliveoil dataset#
[2]:
#vins dataset
from discrimintools.datasets import load_oliveoil
D = load_oliveoil("train")
print(D.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CLASSE 569 non-null object
1 palmitic 569 non-null int64
2 palmitoleic 569 non-null int64
3 stearic 569 non-null int64
4 oleic 569 non-null int64
5 linoleic 569 non-null int64
6 linolenic 569 non-null int64
7 arachidic 569 non-null int64
8 eicosenoic 569 non-null int64
dtypes: int64(8), object(1)
memory usage: 40.1+ KB
None
Forward selection#
[3]:
from discrimintools import CANDISC, STEPDISC
#split into X and y
y, X = D["CLASSE"], D.drop(columns=["CLASSE"])
clf = CANDISC(n_components=2).fit(X,y)
clf2 = STEPDISC(method="forward",alpha=0.01,verbose=True)
clf2.fit(clf)
====================== Step 1 forward selection results =======================
Wilks' Lambda Partial R-Square F Value Num DF Den DF \
palmitic 0.538509 0.461491 242.524854 2 566
palmitoleic 0.604905 0.395095 184.841942 2 566
stearic 0.998272 0.001728 0.489942 2 566
oleic 0.473479 0.526521 314.703134 2 566
linoleic 0.550371 0.449629 231.198312 2 566
linolenic 0.687722 0.312278 128.503464 2 566
arachidic 0.662890 0.337110 143.918675 2 566
eicosenoic 0.202071 0.797929 1117.498522 2 566
Pr>F
palmitic 8.465810e-77
palmitoleic 1.650063e-62
stearic 6.129213e-01
oleic 1.288711e-92
linoleic 4.032628e-74
linolenic 9.724383e-47
arachidic 2.936859e-51
eicosenoic 2.867939e-197
Variable eicosenoic will enter
====================== Step 2 forward selection results =======================
Wilks' Lambda Partial R-Square F Value Num DF Den DF \
palmitic 0.130129 0.356025 156.181332 2 565
palmitoleic 0.123582 0.388421 179.418881 2 565
stearic 0.184593 0.086494 26.748148 2 565
oleic 0.102388 0.493307 275.036355 2 565
linoleic 0.094108 0.534283 324.091029 2 565
linolenic 0.195821 0.030927 9.015798 2 565
arachidic 0.139761 0.308355 125.946400 2 565
Pr>F
palmitic 1.012924e-54
palmitoleic 4.708670e-61
stearic 7.960766e-12
oleic 3.895257e-84
linoleic 1.756295e-94
linolenic 1.398527e-04
arachidic 5.848639e-46
Variable linoleic will enter
====================== Step 3 forward selection results =======================
Wilks' Lambda Partial R-Square F Value Num DF Den DF \
palmitic 0.064624 0.313297 128.658103 2 564
palmitoleic 0.054167 0.424414 207.935167 2 564
stearic 0.088816 0.056230 16.801620 2 564
oleic 0.070818 0.247485 92.743401 2 564
linolenic 0.078805 0.162609 54.760308 2 564
arachidic 0.064452 0.315126 129.754781 2 564
Pr>F
palmitic 9.306291e-47
palmitoleic 2.244706e-68
stearic 8.170662e-08
oleic 1.504050e-35
linolenic 1.843977e-22
arachidic 4.386814e-47
Variable palmitoleic will enter
====================== Step 4 forward selection results =======================
Wilks' Lambda Partial R-Square F Value Num DF Den DF \
palmitic 0.051660 0.046294 13.664275 2 563
stearic 0.053220 0.017479 5.007993 2 563
oleic 0.051666 0.046172 13.626478 2 563
linolenic 0.045867 0.153235 50.941613 2 563
arachidic 0.039318 0.274139 106.315330 2 563
Pr>F
palmitic 1.604028e-06
stearic 6.985163e-03
oleic 1.662907e-06
linolenic 4.626894e-21
arachidic 6.764564e-40
Variable arachidic will enter
====================== Step 5 forward selection results =======================
Wilks' Lambda Partial R-Square F Value Num DF Den DF \
palmitic 0.036354 0.075390 22.912027 2 562
stearic 0.038407 0.023180 6.668090 2 562
oleic 0.037676 0.041772 12.249610 2 562
linolenic 0.034623 0.119397 38.099399 2 562
Pr>F
palmitic 2.718440e-10
stearic 1.373760e-03
oleic 6.205177e-06
linolenic 3.042803e-16
Variable linolenic will enter
====================== Step 6 forward selection results =======================
Wilks' Lambda Partial R-Square F Value Num DF Den DF Pr>F
palmitic 0.032986 0.047282 13.920669 2 561 0.000001
stearic 0.034059 0.016309 4.650441 2 561 0.009929
oleic 0.034059 0.016292 4.645700 2 561 0.009975
Variable palmitic will enter
====================== Step 7 forward selection results =======================
Wilks' Lambda Partial R-Square F Value Num DF Den DF Pr>F
stearic 0.032434 0.016756 4.771705 2 560 0.008813
oleic 0.032173 0.024665 7.080984 2 560 0.000918
Variable oleic will enter
====================== Step 8 forward selection results =======================
Wilks' Lambda Partial R-Square F Value Num DF Den DF Pr>F
stearic 0.03196 0.00662 1.862669 2 559 0.15622
No variable can enter
[3]:
STEPDISC()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| method | 'forward' | |
| alpha | 0.01 | |
| lambda_init | None | |
| verbose | True |
Selected variables#
[4]:
#selected variables
print(clf2.summary_.selected)
['eicosenoic', 'linoleic', 'palmitoleic', 'arachidic', 'linolenic', 'palmitic', 'oleic']
summary#
[5]:
from discrimintools import summarySTEPDISC
summarySTEPDISC(clf2)
Stepwise Discriminant Analysis - Results
====================== Before forward selection =======================
Canonical Discriminant Analysis - Results
Summary Information:
infos Value DF DF value
0 Total Sample Size 569 DF Total 568
1 Variables 8 DF Within Classes 566
2 Classes 3 DF Between Classes 2
Class Level Information:
Frequency Proportion Prior Probability
Centre_North 150 0.2636 0.2636
Sardinia 97 0.1705 0.1705
South 322 0.5659 0.5659
Total-Sample Class Means:
Centre_North Sardinia South
palmitic 1094.8333 1112.0619 1332.3696
palmitoleic 83.8933 96.3505 154.8882
stearic 231.0400 226.3505 228.7081
oleic 7791.9733 7266.9072 7099.5311
linoleic 727.8800 1197.3608 1034.0093
linolenic 21.7467 27.0103 38.0373
arachidic 37.5467 73.0000 63.1025
eicosenoic 1.9733 1.9278 27.3323
Importance of components:
Eigenvalue Difference Proportion Cumulative
Can1 8.4718 6.1684 78.6232 78.6232
Can2 2.3034 NaN 21.3768 100.0000
Raw Canonical and Classification Functions Coefficients:
Can1 Can2 Centre_North Sardinia South
Constant -13.0646 -56.9169 -70.0899 194.6549 -37.1812
palmitic 0.0028 0.0089 0.0072 -0.0344 0.0070
palmitoleic 0.0131 0.0184 -0.0095 -0.0959 0.0333
stearic -0.0028 0.0043 0.0171 -0.0029 -0.0071
oleic 0.0006 0.0062 0.0094 -0.0199 0.0016
linoleic 0.0011 -0.0013 -0.0061 -0.0001 0.0029
linolenic 0.0411 0.0058 -0.1257 -0.1523 0.1045
arachidic -0.0173 -0.0347 -0.0063 0.1565 -0.0442
eicosenoic 0.1631 0.0101 -0.5231 -0.5673 0.4146
====================== After forward selection =======================
Canonical Discriminant Analysis - Results
Summary Information:
infos Value DF DF value
0 Total Sample Size 569 DF Total 568
1 Variables 7 DF Within Classes 566
2 Classes 3 DF Between Classes 2
Class Level Information:
Frequency Proportion Prior Probability
Centre_North 150 0.2636 0.2636
Sardinia 97 0.1705 0.1705
South 322 0.5659 0.5659
Total-Sample Class Means:
Centre_North Sardinia South
eicosenoic 1.9733 1.9278 27.3323
linoleic 727.8800 1197.3608 1034.0093
palmitoleic 83.8933 96.3505 154.8882
arachidic 37.5467 73.0000 63.1025
linolenic 21.7467 27.0103 38.0373
palmitic 1094.8333 1112.0619 1332.3696
oleic 7791.9733 7266.9072 7099.5311
Importance of components:
Eigenvalue Difference Proportion Cumulative
Can1 8.4496 6.1603 78.6823 78.6823
Can2 2.2893 NaN 21.3177 100.0000
Raw Canonical and Classification Functions Coefficients:
Can1 Can2 Centre_North Sardinia South
Constant -32.1378 -27.9172 46.3443 174.6732 -85.3694
eicosenoic 0.1649 0.0067 -0.5330 -0.5656 0.4187
linoleic 0.0029 -0.0040 -0.0171 0.0018 0.0074
palmitoleic 0.0150 0.0155 -0.0210 -0.0940 0.0381
arachidic -0.0157 -0.0374 -0.0170 0.1583 -0.0398
linolenic 0.0435 0.0020 -0.1403 -0.1498 0.1105
palmitic 0.0048 0.0058 -0.0050 -0.0324 0.0121
oleic 0.0025 0.0034 -0.0021 -0.0179 0.0064
Evaluation of prediction on testing dataset#
[6]:
#testining data
DTest = load_oliveoil("test")
#split into X and y
yTest, XTest = DTest["CLASSE"], DTest.drop(columns=["CLASSE"])
#evaluation of prediction on testing dataset
eval_test = clf2.eval_predict(XTest,yTest,verbose=True)
Observation Profile:
Read Used
Number of Observations 3 3
Number of Observations Classified into CLASSE:
prediction Centre_North Sardinia South Total
CLASSE
Centre_North 1 0 0 1
Sardinia 0 1 0 1
South 0 0 1 1
Total 1 1 1 3
Percent Classified into CLASSE:
prediction Centre_North Sardinia South Total
CLASSE
Centre_North 100.000000 0.000000 0.000000 100.0
Sardinia 0.000000 100.000000 0.000000 100.0
South 0.000000 0.000000 100.000000 100.0
Total 33.333333 33.333333 33.333333 100.0
Priors 0.263620 0.170475 0.565905 NaN
Error Count Estimates for CLASSE:
Centre_North Sardinia South Total
Rate 0.00000 0.000000 0.000000 0.0
Priors 0.26362 0.170475 0.565905 NaN
Classification Report for CLASSE:
precision recall f1-score support
Centre_North 1.0 1.0 1.0 1.0
Sardinia 1.0 1.0 1.0 1.0
South 1.0 1.0 1.0 1.0
accuracy 1.0 1.0 1.0 1.0
macro avg 1.0 1.0 1.0 3.0
weighted avg 1.0 1.0 1.0 3.0
backward selection#
[7]:
#backward selection
clf2 = STEPDISC(method="backward",alpha=0.01,verbose=True)
clf2.fit(clf)
====================== Step 1 backward selection results =======================
Wilks' Lambda Partial R-Square F Value Num DF Den DF Pr>F
stearic 1.0 0.96804 8465.862059 2 559 0.0
No variable can be removed
Since only one feature is selected, CANDISC procedure cannot be updated.
[7]:
STEPDISC(method='backward')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| method | 'backward' | |
| alpha | 0.01 | |
| lambda_init | None | |
| verbose | True |
Selected variables#
[8]:
#selected variables
print(clf2.summary_.selected)
['stearic']
Summary#
[9]:
from discrimintools import summarySTEPDISC
summarySTEPDISC(clf2)
Stepwise Discriminant Analysis - Results
====================== Before backward selection =======================
Canonical Discriminant Analysis - Results
Summary Information:
infos Value DF DF value
0 Total Sample Size 569 DF Total 568
1 Variables 8 DF Within Classes 566
2 Classes 3 DF Between Classes 2
Class Level Information:
Frequency Proportion Prior Probability
Centre_North 150 0.2636 0.2636
Sardinia 97 0.1705 0.1705
South 322 0.5659 0.5659
Total-Sample Class Means:
Centre_North Sardinia South
palmitic 1094.8333 1112.0619 1332.3696
palmitoleic 83.8933 96.3505 154.8882
stearic 231.0400 226.3505 228.7081
oleic 7791.9733 7266.9072 7099.5311
linoleic 727.8800 1197.3608 1034.0093
linolenic 21.7467 27.0103 38.0373
arachidic 37.5467 73.0000 63.1025
eicosenoic 1.9733 1.9278 27.3323
Importance of components:
Eigenvalue Difference Proportion Cumulative
Can1 8.4718 6.1684 78.6232 78.6232
Can2 2.3034 NaN 21.3768 100.0000
Raw Canonical and Classification Functions Coefficients:
Can1 Can2 Centre_North Sardinia South
Constant -13.0646 -56.9169 -70.0899 194.6549 -37.1812
palmitic 0.0028 0.0089 0.0072 -0.0344 0.0070
palmitoleic 0.0131 0.0184 -0.0095 -0.0959 0.0333
stearic -0.0028 0.0043 0.0171 -0.0029 -0.0071
oleic 0.0006 0.0062 0.0094 -0.0199 0.0016
linoleic 0.0011 -0.0013 -0.0061 -0.0001 0.0029
linolenic 0.0411 0.0058 -0.1257 -0.1523 0.1045
arachidic -0.0173 -0.0347 -0.0063 0.1565 -0.0442
eicosenoic 0.1631 0.0101 -0.5231 -0.5673 0.4146
No model has been updated.