STEPDISC LDA - alcool dataset#

[1]:
#disable warnings
from warnings import simplefilter, filterwarnings
simplefilter(action='ignore', category=FutureWarning)
filterwarnings("ignore")

alcools dataset#

[2]:
#vins dataset
from discrimintools.datasets import load_alcools
D = load_alcools("train")
print(D.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   TYPE    52 non-null     object
 1   MEOH    52 non-null     float64
 2   ACET    52 non-null     float64
 3   BU1     52 non-null     float64
 4   BU2     52 non-null     float64
 5   ISOP    52 non-null     int64
 6   MEPR    52 non-null     float64
 7   PRO1    52 non-null     float64
 8   ACAL    52 non-null     float64
dtypes: float64(7), int64(1), object(1)
memory usage: 3.8+ KB
None

instanciation and training#

[3]:
from discrimintools import DISCRIM, STEPDISC
#split into X and y
y, X = D["TYPE"], D.drop(columns=["TYPE"])
clf = DISCRIM().fit(X,y)
clf2 = STEPDISC(method="forward",alpha=0.01,verbose=True)
clf2.fit(clf)

====================== Step 1 forward selection results =======================
      Wilks' Lambda  Partial R-Square    F Value  Num DF  Den DF          Pr>F
MEOH       0.282629          0.717371  62.186129       2      49  3.587597e-14
ACET       0.971855          0.028145   0.709531       2      49  4.968583e-01
BU1        0.286173          0.713827  61.112585       2      49  4.868527e-14
BU2        0.914588          0.085412   2.288014       2      49  1.122087e-01
ISOP       0.887731          0.112269   3.098457       2      49  5.406192e-02
MEPR       0.691854          0.308146  10.912106       2      49  1.203236e-04
PRO1       0.835465          0.164535   4.824978       2      49  1.222491e-02
ACAL       0.979642          0.020358   0.509127       2      49  6.041644e-01

Variable MEOH will enter


====================== Step 2 forward selection results =======================
      Wilks' Lambda  Partial R-Square    F Value  Num DF  Den DF      Pr>F
ACET       0.253614          0.102660   2.745708       2      48  0.074297
BU1        0.192547          0.318729  11.228252       2      48  0.000100
BU2        0.244101          0.136320   3.788072       2      48  0.029680
ISOP       0.264061          0.065697   1.687604       2      48  0.195751
MEPR       0.221217          0.217287   6.662572       2      48  0.002796
PRO1       0.255676          0.095365   2.530037       2      48  0.090232
ACAL       0.235697          0.166054   4.778852       2      48  0.012803

Variable BU1 will enter


====================== Step 3 forward selection results =======================
      Wilks' Lambda  Partial R-Square   F Value  Num DF  Den DF      Pr>F
ACET       0.178725          0.071786  1.817445       2      47  0.173671
BU2        0.170291          0.115585  3.071236       2      47  0.055772
ISOP       0.174351          0.094502  2.452583       2      47  0.097018
MEPR       0.147786          0.232468  7.117602       2      47  0.001994
PRO1       0.176100          0.085419  2.194821       2      47  0.122666
ACAL       0.173496          0.098943  2.580493       2      47  0.086432

Variable MEPR will enter


====================== Step 4 forward selection results =======================
      Wilks' Lambda  Partial R-Square   F Value  Num DF  Den DF      Pr>F
ACET       0.138022          0.066069  1.627088       2      46  0.207606
BU2        0.131479          0.110340  2.852582       2      46  0.067944
ISOP       0.129820          0.121570  3.183082       2      46  0.050730
PRO1       0.136572          0.075879  1.888507       2      46  0.162842
ACAL       0.127365          0.138180  3.687719       2      46  0.032702

No variable can enter

[3]:
STEPDISC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Selected variables#

[4]:
#selected variables
print(clf2.summary_.selected)
['MEOH', 'BU1', 'MEPR']

summary#

[5]:
from discrimintools import summarySTEPDISC
summarySTEPDISC(clf2,detailed=True)
                     Stepwise Discriminant Analysis - Results

====================== Before forward selection  =======================

                     Discriminant Analysis - Results

Summary Information:
               Infos  Value                  DF  DF value
0  Total Sample Size     52            DF Total        51
1          Variables      8   DF Within Classes        49
2            Classes      3  DF Between Classes         2

Class Level Information:
        Frequency  Proportion  Prior Probability
KIRSCH         17      0.3269             0.3269
MIRAB          15      0.2885             0.2885
POIRE          20      0.3846             0.3846

Pooled Covariance Matrix Information:
        Rank  Natural Log of the Determinant
Pooled     8                         58.3267

Linear Discriminant Function for TYPE:
          KIRSCH    MIRAB    POIRE
Constant -5.0165 -18.8407 -24.7649
MEOH      0.0034   0.0290   0.0334
ACET      0.0064   0.0164   0.0075
BU1      -0.0637   0.4054   0.3180
BU2      -0.0009   0.0714   0.1150
ISOP      0.0231   0.0298  -0.0085
MEPR      0.0375  -0.1289   0.0618
PRO1      0.0020  -0.0054  -0.0083
ACAL      0.0662  -0.2264  -0.1303

Classification Summary for Calibration Data:

Observation Profile:
                        Read  Used
Number of Observations    52    52

Number of Observations Classified into TYPE:
prediction  KIRSCH  MIRAB  POIRE  Total
TYPE
KIRSCH          17      0      0     17
MIRAB            0     14      1     15
POIRE            0      2     18     20
Total           17     16     19     52

Percent Classified into TYPE:
prediction    KIRSCH    MIRAB    POIRE  Total
TYPE
KIRSCH      100.0000   0.0000   0.0000  100.0
MIRAB         0.0000  93.3333   6.6667  100.0
POIRE         0.0000  10.0000  90.0000  100.0
Total        32.6923  30.7692  36.5385  100.0
Priors        0.3269   0.2885   0.3846    NaN

Error Count Estimates for TYPE:
        KIRSCH   MIRAB   POIRE   Total
Rate    0.0000  0.0667  0.1000  0.0577
Priors  0.3269  0.2885  0.3846     NaN

Classification Report for TYPE:
              precision  recall  f1-score  support
KIRSCH           1.0000  1.0000    1.0000  17.0000
MIRAB            0.8750  0.9333    0.9032  15.0000
POIRE            0.9474  0.9000    0.9231  20.0000
accuracy         0.9423  0.9423    0.9423   0.9423
macro avg        0.9408  0.9444    0.9421  52.0000
weighted avg     0.9437  0.9423    0.9425  52.0000

====================== After forward selection  =======================

                     Discriminant Analysis - Results

Summary Information:
               Infos  Value                  DF  DF value
0  Total Sample Size     52            DF Total        51
1          Variables      3   DF Within Classes        49
2            Classes      3  DF Between Classes         2

Class Level Information:
        Frequency  Proportion  Prior Probability
KIRSCH         17      0.3269             0.3269
MIRAB          15      0.2885             0.2885
POIRE          20      0.3846             0.3846

Pooled Covariance Matrix Information:
        Rank  Natural Log of the Determinant
Pooled     3                         19.4106

Linear Discriminant Function for TYPE:
          KIRSCH    MIRAB    POIRE
Constant -3.6107 -14.7754 -18.3711
MEOH      0.0069   0.0213   0.0226
BU1      -0.0766   0.4010   0.3735
MEPR      0.0867  -0.0325   0.0467

Classification Summary for Calibration Data:

Observation Profile:
                        Read  Used
Number of Observations    52    52

Number of Observations Classified into TYPE:
prediction  KIRSCH  MIRAB  POIRE  Total
TYPE
KIRSCH          17      0      0     17
MIRAB            0     12      3     15
POIRE            0      4     16     20
Total           17     16     19     52

Percent Classified into TYPE:
prediction    KIRSCH    MIRAB    POIRE  Total
TYPE
KIRSCH      100.0000   0.0000   0.0000  100.0
MIRAB         0.0000  80.0000  20.0000  100.0
POIRE         0.0000  20.0000  80.0000  100.0
Total        32.6923  30.7692  36.5385  100.0
Priors        0.3269   0.2885   0.3846    NaN

Error Count Estimates for TYPE:
        KIRSCH   MIRAB   POIRE   Total
Rate    0.0000  0.2000  0.2000  0.1346
Priors  0.3269  0.2885  0.3846     NaN

Classification Report for TYPE:
              precision  recall  f1-score  support
KIRSCH           1.0000  1.0000    1.0000  17.0000
MIRAB            0.7500  0.8000    0.7742  15.0000
POIRE            0.8421  0.8000    0.8205  20.0000
accuracy         0.8654  0.8654    0.8654   0.8654
macro avg        0.8640  0.8667    0.8649  52.0000
weighted avg     0.8672  0.8654    0.8658  52.0000

Evaluation of prediction on testing dataset#

Testing data#

[6]:
#testining data
DTest = load_alcools("test")
DTest.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   TYPE    50 non-null     object
 1   MEOH    50 non-null     int64
 2   ACET    50 non-null     int64
 3   BU1     50 non-null     float64
 4   BU2     50 non-null     float64
 5   ISOP    50 non-null     int64
 6   MEPR    50 non-null     int64
 7   PRO1    50 non-null     int64
 8   ACAL    50 non-null     float64
dtypes: float64(3), int64(5), object(1)
memory usage: 3.6+ KB
[7]:
#split into X and y
yTest, XTest = DTest["TYPE"], DTest.drop(columns=["TYPE"])
eval_test = clf2.eval_predict(XTest,yTest,verbose=True)
Observation Profile:
                        Read  Used
Number of Observations    50    50

Number of Observations Classified into TYPE:
prediction  KIRSCH  MIRAB  POIRE  Total
TYPE
KIRSCH          14      0      0     14
MIRAB            0     12      5     17
POIRE            2      8      9     19
Total           16     20     14     50

Percent Classified into TYPE:
prediction      KIRSCH      MIRAB      POIRE  Total
TYPE
KIRSCH      100.000000   0.000000   0.000000  100.0
MIRAB         0.000000  70.588235  29.411765  100.0
POIRE        10.526316  42.105263  47.368421  100.0
Total        32.000000  40.000000  28.000000  100.0
Priors        0.326923   0.288462   0.384615    NaN

Error Count Estimates for TYPE:
          KIRSCH     MIRAB     POIRE     Total
Rate    0.000000  0.294118  0.526316  0.287271
Priors  0.326923  0.288462  0.384615       NaN

Classification Report for TYPE:
              precision    recall  f1-score  support
KIRSCH         0.875000  1.000000  0.933333     14.0
MIRAB          0.600000  0.705882  0.648649     17.0
POIRE          0.642857  0.473684  0.545455     19.0
accuracy       0.700000  0.700000  0.700000      0.7
macro avg      0.705952  0.726522  0.709146     50.0
weighted avg   0.693286  0.700000  0.689147     50.0