DISCRIM (LDA) - heart dataset#

[1]:
#disable warnings
from warnings import simplefilter, filterwarnings
simplefilter(action='ignore', category=FutureWarning)
filterwarnings("ignore")

heart dataset#

[2]:
#vins dataset
from discrimintools.datasets import load_heart
D = load_heart("train")
print(D.info())
<class 'pandas.core.frame.DataFrame'>
Index: 150 entries, 0 to 149
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   disease         150 non-null    object
 1   age             150 non-null    int64
 2   sex             150 non-null    object
 3   chestpain       150 non-null    object
 4   restbpress      150 non-null    int64
 5   cholesteral     150 non-null    int64
 6   sugar           150 non-null    object
 7   electro         150 non-null    object
 8   maxHeartRate    150 non-null    int64
 9   ExerciseAngina  150 non-null    object
 10  oldpeak         150 non-null    float64
 11  slope           150 non-null    object
 12  vesselsColored  150 non-null    int64
 13  thal            150 non-null    object
dtypes: float64(1), int64(5), object(8)
memory usage: 17.6+ KB
None
[3]:
#split into X and y
y, X = D["disease"], D.drop(columns=["disease"])

instanciation and training#

[4]:
from discrimintools import DISCRIM
clf = DISCRIM()
clf.fit(X,y)

Categorical features have been encoded into binary variables.

[4]:
DISCRIM(priors='prop')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluation of prediction on training dataset#

[5]:
#eval_predict function
eval_train = clf.eval_predict(X,y,verbose=True)
Observation Profile:
                        Read  Used
Number of Observations   150   150

Number of Observations Classified into disease:
prediction  absence  presence  Total
disease
absence          75         7     82
presence         12        56     68
Total            87        63    150

Percent Classified into disease:
prediction    absence   presence  Total
disease
absence     91.463415   8.536585  100.0
presence    17.647059  82.352941  100.0
Total       58.000000  42.000000  100.0
Priors       0.546667   0.453333    NaN

Error Count Estimates for disease:
         absence  presence     Total
Rate    0.085366  0.176471  0.126667
Priors  0.546667  0.453333       NaN

Classification Report for disease:
              precision    recall  f1-score     support
absence        0.862069  0.914634  0.887574   82.000000
presence       0.888889  0.823529  0.854962   68.000000
accuracy       0.873333  0.873333  0.873333    0.873333
macro avg      0.875479  0.869082  0.871268  150.000000
weighted avg   0.874227  0.873333  0.872790  150.000000
[6]:
#score function
print("Accuracy : {}%".format(100*round(clf.score(X,y),2)))
Accuracy : 87.0%
[7]:
#error rate
print("Error rate : {}%".format(100-100*round(clf.score(X,y),2)))
Error rate : 13.0%

Linear Discriminant Function#

[8]:
#Linear Discriminant Function
print(clf.coef_)
                              absence    presence
Constant                  -124.354638 -127.386302
age                          1.183624    1.191365
sexmale                     14.265904   15.912318
chestpainatypicalAngina      0.566767   -1.883935
chestpainnonAnginal          3.487199    1.665227
chestpaintypicalAngina      -2.808094   -7.422257
restbpress                   0.346029    0.369027
cholesteral                  0.040710    0.037329
sugarlow                    10.405711   11.714556
electrosttAbnormality      -15.811833  -12.821336
electroventricHypertrophy   -1.855260   -1.241053
maxHeartRate                 0.485582    0.457931
ExerciseAnginayes            4.917031    5.961214
oldpeak                      3.065199    3.775126
slopeflat                   18.642871   19.821491
slopeupsloping              14.746685   15.252601
vesselsColored              -2.508721   -1.030825
thalnormal                  21.152513   20.121112
thalreversableEffect        14.853989   16.163566

summary#

[9]:
from discrimintools import summaryDISCRIM
summaryDISCRIM(clf,detailed=True)
                     Discriminant Analysis - Results

Summary Information:
               Infos  Value                  DF  DF value
0  Total Sample Size    150            DF Total       149
1          Variables     18   DF Within Classes       148
2            Classes      2  DF Between Classes         1

Class Level Information:
          Frequency  Proportion  Prior Probability
absence          82      0.5467             0.5467
presence         68      0.4533             0.4533

Pooled Covariance Matrix Information:
        Rank  Natural Log of the Determinant
Pooled    18                         -5.6739

Linear Discriminant Function for disease:
                            absence  presence
Constant                  -124.3546 -127.3863
age                          1.1836    1.1914
sexmale                     14.2659   15.9123
chestpainatypicalAngina      0.5668   -1.8839
chestpainnonAnginal          3.4872    1.6652
chestpaintypicalAngina      -2.8081   -7.4223
restbpress                   0.3460    0.3690
cholesteral                  0.0407    0.0373
sugarlow                    10.4057   11.7146
electrosttAbnormality      -15.8118  -12.8213
electroventricHypertrophy   -1.8553   -1.2411
maxHeartRate                 0.4856    0.4579
ExerciseAnginayes            4.9170    5.9612
oldpeak                      3.0652    3.7751
slopeflat                   18.6429   19.8215
slopeupsloping              14.7467   15.2526
vesselsColored              -2.5087   -1.0308
thalnormal                  21.1525   20.1211
thalreversableEffect        14.8540   16.1636

Classification Summary for Calibration Data:

Observation Profile:
                        Read  Used
Number of Observations   150   150

Number of Observations Classified into disease:
prediction  absence  presence  Total
disease
absence          75         7     82
presence         12        56     68
Total            87        63    150

Percent Classified into disease:
prediction  absence  presence  Total
disease
absence     91.4634    8.5366  100.0
presence    17.6471   82.3529  100.0
Total       58.0000   42.0000  100.0
Priors       0.5467    0.4533    NaN

Error Count Estimates for disease:
        absence  presence   Total
Rate     0.0854    0.1765  0.1267
Priors   0.5467    0.4533     NaN

Classification Report for disease:
              precision  recall  f1-score   support
absence          0.8621  0.9146    0.8876   82.0000
presence         0.8889  0.8235    0.8550   68.0000
accuracy         0.8733  0.8733    0.8733    0.8733
macro avg        0.8755  0.8691    0.8713  150.0000
weighted avg     0.8742  0.8733    0.8728  150.0000

Evaluation of prediction on testing dataset#

Testing data#

[10]:
#testining data
DTest = load_heart("test")
DTest.info()
<class 'pandas.core.frame.DataFrame'>
Index: 120 entries, 150 to 269
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   disease         120 non-null    object
 1   age             120 non-null    int64
 2   sex             120 non-null    object
 3   chestpain       120 non-null    object
 4   restbpress      120 non-null    int64
 5   cholesteral     120 non-null    int64
 6   sugar           120 non-null    object
 7   electro         120 non-null    object
 8   maxHeartRate    120 non-null    int64
 9   ExerciseAngina  120 non-null    object
 10  oldpeak         120 non-null    float64
 11  slope           120 non-null    object
 12  vesselsColored  120 non-null    int64
 13  thal            120 non-null    object
dtypes: float64(1), int64(5), object(8)
memory usage: 14.1+ KB
[11]:
#split into X and y
yTest, XTest = DTest["disease"], DTest.drop(columns=["disease"])
eval_test = clf.eval_predict(XTest,yTest,verbose=True)
Observation Profile:
                        Read  Used
Number of Observations   120   120

Number of Observations Classified into disease:
prediction  absence  presence  Total
disease
absence          59         9     68
presence         11        41     52
Total            70        50    120

Percent Classified into disease:
prediction    absence   presence  Total
disease
absence     86.764706  13.235294  100.0
presence    21.153846  78.846154  100.0
Total       58.333333  41.666667  100.0
Priors       0.546667   0.453333    NaN

Error Count Estimates for disease:
         absence  presence    Total
Rate    0.132353  0.211538  0.16825
Priors  0.546667  0.453333      NaN

Classification Report for disease:
              precision    recall  f1-score     support
absence        0.842857  0.867647  0.855072   68.000000
presence       0.820000  0.788462  0.803922   52.000000
accuracy       0.833333  0.833333  0.833333    0.833333
macro avg      0.831429  0.828054  0.829497  120.000000
weighted avg   0.832952  0.833333  0.832907  120.000000