CANDISC - oliveoil dataset#

[1]:
#disable warnings
from warnings import simplefilter, filterwarnings
simplefilter(action='ignore', category=FutureWarning)
filterwarnings("ignore")

oliveoil dataset#

[2]:
#vins dataset
from discrimintools.datasets import load_oliveoil
D = load_oliveoil()
print(D.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   CLASSE       569 non-null    object
 1   palmitic     569 non-null    int64
 2   palmitoleic  569 non-null    int64
 3   stearic      569 non-null    int64
 4   oleic        569 non-null    int64
 5   linoleic     569 non-null    int64
 6   linolenic    569 non-null    int64
 7   arachidic    569 non-null    int64
 8   eicosenoic   569 non-null    int64
dtypes: int64(8), object(1)
memory usage: 40.1+ KB
None
[3]:
#split into X and y
y, X = D["CLASSE"], D.drop(columns=["CLASSE"])

instanciation and training#

[4]:
from discrimintools import CANDISC
clf = CANDISC(n_components=2)
clf.fit(X,y)
[4]:
CANDISC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

eval_predict function#

[5]:
#eval_predict function
eval_train = clf.eval_predict(X,y,verbose=True)
Observation Profile:
                        Read  Used
Number of Observations   569   569

Number of Observations Classified into CLASSE:
prediction    Centre_North  Sardinia  South  Total
CLASSE
Centre_North           146         4      0    150
Sardinia                 0        97      0     97
South                    1         0    321    322
Total                  147       101    321    569

Percent Classified into CLASSE:
prediction    Centre_North    Sardinia      South  Total
CLASSE
Centre_North     97.333333    2.666667   0.000000  100.0
Sardinia          0.000000  100.000000   0.000000  100.0
South             0.310559    0.000000  99.689441  100.0
Total            25.834798   17.750439  56.414763  100.0
Priors            0.263620    0.170475   0.565905    NaN

Error Count Estimates for CLASSE:
        Centre_North  Sardinia     South     Total
Rate        0.026667  0.000000  0.003106  0.008787
Priors      0.263620  0.170475  0.565905       NaN

Classification Report for CLASSE:
              precision    recall  f1-score     support
Centre_North   0.993197  0.973333  0.983165  150.000000
Sardinia       0.960396  1.000000  0.979798   97.000000
South          1.000000  0.996894  0.998445  322.000000
accuracy       0.991213  0.991213  0.991213    0.991213
macro avg      0.984531  0.990076  0.987136  569.000000
weighted avg   0.991455  0.991213  0.991238  569.000000

score function#

[6]:
#score function
print("Accuracy : {}%".format(100*round(clf.score(X,y),2)))
Accuracy : 99.0%
[7]:
#error rate
print("Error rate : {}%".format(100-100*round(clf.score(X,y),2)))
Error rate : 1.0%

summary#

[8]:
from discrimintools import summaryCANDISC

Simple summary#

[9]:
#simple summary
summaryCANDISC(clf)
                     Canonical Discriminant Analysis - Results

Summary Information:
               infos  Value                  DF  DF value
0  Total Sample Size    569            DF Total       568
1          Variables      8   DF Within Classes       566
2            Classes      3  DF Between Classes         2

Class Level Information:
              Frequency  Proportion  Prior Probability
Centre_North        150      0.2636             0.2636
Sardinia             97      0.1705             0.1705
South               322      0.5659             0.5659

Total-Sample Class Means:
             Centre_North   Sardinia      South
palmitic        1094.8333  1112.0619  1332.3696
palmitoleic       83.8933    96.3505   154.8882
stearic          231.0400   226.3505   228.7081
oleic           7791.9733  7266.9072  7099.5311
linoleic         727.8800  1197.3608  1034.0093
linolenic         21.7467    27.0103    38.0373
arachidic         37.5467    73.0000    63.1025
eicosenoic         1.9733     1.9278    27.3323

Importance of components:
      Eigenvalue  Difference  Proportion  Cumulative
Can1      8.4718      6.1684     78.6232     78.6232
Can2      2.3034         NaN     21.3768    100.0000

Raw Canonical and Classification Functions Coefficients:
                Can1     Can2  Centre_North  Sardinia    South
Constant    -13.0646 -56.9169      -70.0899  194.6549 -37.1812
palmitic      0.0028   0.0089        0.0072   -0.0344   0.0070
palmitoleic   0.0131   0.0184       -0.0095   -0.0959   0.0333
stearic      -0.0028   0.0043        0.0171   -0.0029  -0.0071
oleic         0.0006   0.0062        0.0094   -0.0199   0.0016
linoleic      0.0011  -0.0013       -0.0061   -0.0001   0.0029
linolenic     0.0411   0.0058       -0.1257   -0.1523   0.1045
arachidic    -0.0173  -0.0347       -0.0063    0.1565  -0.0442
eicosenoic    0.1631   0.0101       -0.5231   -0.5673   0.4146

Detailed summary#

[10]:
#detailed summary
summaryCANDISC(clf,detailed=True)
                     Canonical Discriminant Analysis - Results

Summary Information:
               infos  Value                  DF  DF value
0  Total Sample Size    569            DF Total       568
1          Variables      8   DF Within Classes       566
2            Classes      3  DF Between Classes         2

Class Level Information:
              Frequency  Proportion  Prior Probability
Centre_North        150      0.2636             0.2636
Sardinia             97      0.1705             0.1705
South               322      0.5659             0.5659

Total-Sample Class Means:
             Centre_North   Sardinia      South
palmitic        1094.8333  1112.0619  1332.3696
palmitoleic       83.8933    96.3505   154.8882
stearic          231.0400   226.3505   228.7081
oleic           7791.9733  7266.9072  7099.5311
linoleic         727.8800  1197.3608  1034.0093
linolenic         21.7467    27.0103    38.0373
arachidic         37.5467    73.0000    63.1025
eicosenoic         1.9733     1.9278    27.3323

Importance of components:
      Eigenvalue  Difference  Proportion  Cumulative
Can1      8.4718      6.1684     78.6232     78.6232
Can2      2.3034         NaN     21.3768    100.0000

Raw Canonical and Classification Functions Coefficients:
                Can1     Can2  Centre_North  Sardinia    South
Constant    -13.0646 -56.9169      -70.0899  194.6549 -37.1812
palmitic      0.0028   0.0089        0.0072   -0.0344   0.0070
palmitoleic   0.0131   0.0184       -0.0095   -0.0959   0.0333
stearic      -0.0028   0.0043        0.0171   -0.0029  -0.0071
oleic         0.0006   0.0062        0.0094   -0.0199   0.0016
linoleic      0.0011  -0.0013       -0.0061   -0.0001   0.0029
linolenic     0.0411   0.0058       -0.1257   -0.1523   0.1045
arachidic    -0.0173  -0.0347       -0.0063    0.1565  -0.0442
eicosenoic    0.1631   0.0101       -0.5231   -0.5673   0.4146

Test of H0: The canonical correlations in the current row and all that follow are zero
   Canonical Correlation  Squared Canonical Correlation  Likelihood Ratio  \
0                 0.9457                         0.8944            0.0320
1                 0.8350                         0.6973            0.3027

   Approximate F value  Num DF  Den DF  Pr>F  Chi-Square  DF  Pr>Chi2
0             320.9837      16    1118   0.0   1936.8430  16      0.0
1             184.2721       7     560   0.0    672.1608   7      0.0

Classification Summary for Calibration Data:

Observation Profile:
                        Read  Used
Number of Observations   569   569

Number of Observations Classified into CLASSE:
prediction    Centre_North  Sardinia  South  Total
CLASSE
Centre_North           146         4      0    150
Sardinia                 0        97      0     97
South                    1         0    321    322
Total                  147       101    321    569

Percent Classified into CLASSE:
prediction    Centre_North  Sardinia    South  Total
CLASSE
Centre_North       97.3333    2.6667   0.0000  100.0
Sardinia            0.0000  100.0000   0.0000  100.0
South               0.3106    0.0000  99.6894  100.0
Total              25.8348   17.7504  56.4148  100.0
Priors              0.2636    0.1705   0.5659    NaN

Error Count Estimates for CLASSE:
        Centre_North  Sardinia   South   Total
Rate          0.0267    0.0000  0.0031  0.0088
Priors        0.2636    0.1705  0.5659     NaN

Classification Report for CLASSE:
              precision  recall  f1-score   support
Centre_North     0.9932  0.9733    0.9832  150.0000
Sardinia         0.9604  1.0000    0.9798   97.0000
South            1.0000  0.9969    0.9984  322.0000
accuracy         0.9912  0.9912    0.9912    0.9912
macro avg        0.9845  0.9901    0.9871  569.0000
weighted avg     0.9915  0.9912    0.9912  569.0000

Evaluation of prediction on testing dataset#

Testing data#

[11]:
#testining data
DTest = load_oliveoil("test")
print(DTest.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   CLASSE       3 non-null      object
 1   palmitic     3 non-null      int64
 2   palmitoleic  3 non-null      int64
 3   stearic      3 non-null      int64
 4   oleic        3 non-null      int64
 5   linoleic     3 non-null      int64
 6   linolenic    3 non-null      int64
 7   arachidic    3 non-null      int64
 8   eicosenoic   3 non-null      int64
dtypes: int64(8), object(1)
memory usage: 344.0+ bytes
None
[12]:
#display
print(DTest)
         CLASSE  palmitic  palmitoleic  stearic  oleic  linoleic  linolenic  \
0      Sardinia      1042          135      210   7376      1116         35
1  Centre_North      1090           60      195   7955       600         28
2         South      1306          127      250   7254       869         47

   arachidic  eicosenoic
0         90           3
1         42           2
2         68          24
[13]:
#display into X and y
yTest, XTest = DTest["CLASSE"], DTest.drop(columns=["CLASSE"])
#eval_predict function
eval_test = clf.eval_predict(XTest,yTest,verbose=True)
Observation Profile:
                        Read  Used
Number of Observations     3     3

Number of Observations Classified into CLASSE:
prediction    Centre_North  Sardinia  South  Total
CLASSE
Centre_North             1         0      0      1
Sardinia                 0         1      0      1
South                    0         0      1      1
Total                    1         1      1      3

Percent Classified into CLASSE:
prediction    Centre_North    Sardinia       South  Total
CLASSE
Centre_North    100.000000    0.000000    0.000000  100.0
Sardinia          0.000000  100.000000    0.000000  100.0
South             0.000000    0.000000  100.000000  100.0
Total            33.333333   33.333333   33.333333  100.0
Priors            0.263620    0.170475    0.565905    NaN

Error Count Estimates for CLASSE:
        Centre_North  Sardinia     South  Total
Rate         0.00000  0.000000  0.000000    0.0
Priors       0.26362  0.170475  0.565905    NaN

Classification Report for CLASSE:
              precision  recall  f1-score  support
Centre_North        1.0     1.0       1.0      1.0
Sardinia            1.0     1.0       1.0      1.0
South               1.0     1.0       1.0      1.0
accuracy            1.0     1.0       1.0      1.0
macro avg           1.0     1.0       1.0      3.0
weighted avg        1.0     1.0       1.0      3.0

Plotting functions#

[14]:
#plotting
from discrimintools import fviz_candisc

Graph of individuals#

[15]:
#graph of individuals
p = fviz_candisc(clf,element="ind",repel=True)
p.show()
../../_images/source_examples_02_candisc_oliveoil_25_0.png

We add supplementary individuals to initial plot.

[16]:
#with supplementary individuals
from discrimintools import add_scatter
p = add_scatter(p,clf.transform(XTest),color="blue",repel=True)
p.show()
../../_images/source_examples_02_candisc_oliveoil_27_0.png

Graph of variables#

[17]:
#graph of variables
fviz_candisc(clf,element="var",repel=True).show()
../../_images/source_examples_02_candisc_oliveoil_29_0.png

Biplot of individuals and variables#

[18]:
#biplot of individuals and variables
p = fviz_candisc(clf,element="biplot",repel=True)
#add supplementary individuals
p = add_scatter(p,clf.transform(XTest),color="blue",repel=True)
p.show()
../../_images/source_examples_02_candisc_oliveoil_31_0.png

Distance between barycenter#

[19]:
#distance between barycenter
fviz_candisc(clf,element="dist",repel=True).show()
../../_images/source_examples_02_candisc_oliveoil_33_0.png