DISCRIM (LDA) - heart dataset#
[1]:
#disable warnings
from warnings import simplefilter, filterwarnings
simplefilter(action='ignore', category=FutureWarning)
filterwarnings("ignore")
heart dataset#
[2]:
#vins dataset
from discrimintools.datasets import load_heart
D = load_heart("train")
print(D.info())
<class 'pandas.core.frame.DataFrame'>
Index: 150 entries, 0 to 149
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 disease 150 non-null object
1 age 150 non-null int64
2 sex 150 non-null object
3 chestpain 150 non-null object
4 restbpress 150 non-null int64
5 cholesteral 150 non-null int64
6 sugar 150 non-null object
7 electro 150 non-null object
8 maxHeartRate 150 non-null int64
9 ExerciseAngina 150 non-null object
10 oldpeak 150 non-null float64
11 slope 150 non-null object
12 vesselsColored 150 non-null int64
13 thal 150 non-null object
dtypes: float64(1), int64(5), object(8)
memory usage: 17.6+ KB
None
[3]:
#split into X and y
y, X = D["disease"], D.drop(columns=["disease"])
instanciation and training#
[4]:
from discrimintools import DISCRIM
clf = DISCRIM()
clf.fit(X,y)
Categorical features have been encoded into binary variables.
[4]:
DISCRIM(priors='prop')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| method | 'linear' | |
| priors | 'prop' | |
| classes | None | |
| var_select | False | |
| level | None | |
| tol | None | |
| warn_message | True |
Evaluation of prediction on training dataset#
[5]:
#eval_predict function
eval_train = clf.eval_predict(X,y,verbose=True)
Observation Profile:
Read Used
Number of Observations 150 150
Number of Observations Classified into disease:
prediction absence presence Total
disease
absence 75 7 82
presence 12 56 68
Total 87 63 150
Percent Classified into disease:
prediction absence presence Total
disease
absence 91.463415 8.536585 100.0
presence 17.647059 82.352941 100.0
Total 58.000000 42.000000 100.0
Priors 0.546667 0.453333 NaN
Error Count Estimates for disease:
absence presence Total
Rate 0.085366 0.176471 0.126667
Priors 0.546667 0.453333 NaN
Classification Report for disease:
precision recall f1-score support
absence 0.862069 0.914634 0.887574 82.000000
presence 0.888889 0.823529 0.854962 68.000000
accuracy 0.873333 0.873333 0.873333 0.873333
macro avg 0.875479 0.869082 0.871268 150.000000
weighted avg 0.874227 0.873333 0.872790 150.000000
[6]:
#score function
print("Accuracy : {}%".format(100*round(clf.score(X,y),2)))
Accuracy : 87.0%
[7]:
#error rate
print("Error rate : {}%".format(100-100*round(clf.score(X,y),2)))
Error rate : 13.0%
Linear Discriminant Function#
[8]:
#Linear Discriminant Function
print(clf.coef_)
absence presence
Constant -124.354638 -127.386302
age 1.183624 1.191365
sexmale 14.265904 15.912318
chestpainatypicalAngina 0.566767 -1.883935
chestpainnonAnginal 3.487199 1.665227
chestpaintypicalAngina -2.808094 -7.422257
restbpress 0.346029 0.369027
cholesteral 0.040710 0.037329
sugarlow 10.405711 11.714556
electrosttAbnormality -15.811833 -12.821336
electroventricHypertrophy -1.855260 -1.241053
maxHeartRate 0.485582 0.457931
ExerciseAnginayes 4.917031 5.961214
oldpeak 3.065199 3.775126
slopeflat 18.642871 19.821491
slopeupsloping 14.746685 15.252601
vesselsColored -2.508721 -1.030825
thalnormal 21.152513 20.121112
thalreversableEffect 14.853989 16.163566
summary#
[9]:
from discrimintools import summaryDISCRIM
summaryDISCRIM(clf,detailed=True)
Discriminant Analysis - Results
Summary Information:
Infos Value DF DF value
0 Total Sample Size 150 DF Total 149
1 Variables 18 DF Within Classes 148
2 Classes 2 DF Between Classes 1
Class Level Information:
Frequency Proportion Prior Probability
absence 82 0.5467 0.5467
presence 68 0.4533 0.4533
Pooled Covariance Matrix Information:
Rank Natural Log of the Determinant
Pooled 18 -5.6739
Linear Discriminant Function for disease:
absence presence
Constant -124.3546 -127.3863
age 1.1836 1.1914
sexmale 14.2659 15.9123
chestpainatypicalAngina 0.5668 -1.8839
chestpainnonAnginal 3.4872 1.6652
chestpaintypicalAngina -2.8081 -7.4223
restbpress 0.3460 0.3690
cholesteral 0.0407 0.0373
sugarlow 10.4057 11.7146
electrosttAbnormality -15.8118 -12.8213
electroventricHypertrophy -1.8553 -1.2411
maxHeartRate 0.4856 0.4579
ExerciseAnginayes 4.9170 5.9612
oldpeak 3.0652 3.7751
slopeflat 18.6429 19.8215
slopeupsloping 14.7467 15.2526
vesselsColored -2.5087 -1.0308
thalnormal 21.1525 20.1211
thalreversableEffect 14.8540 16.1636
Classification Summary for Calibration Data:
Observation Profile:
Read Used
Number of Observations 150 150
Number of Observations Classified into disease:
prediction absence presence Total
disease
absence 75 7 82
presence 12 56 68
Total 87 63 150
Percent Classified into disease:
prediction absence presence Total
disease
absence 91.4634 8.5366 100.0
presence 17.6471 82.3529 100.0
Total 58.0000 42.0000 100.0
Priors 0.5467 0.4533 NaN
Error Count Estimates for disease:
absence presence Total
Rate 0.0854 0.1765 0.1267
Priors 0.5467 0.4533 NaN
Classification Report for disease:
precision recall f1-score support
absence 0.8621 0.9146 0.8876 82.0000
presence 0.8889 0.8235 0.8550 68.0000
accuracy 0.8733 0.8733 0.8733 0.8733
macro avg 0.8755 0.8691 0.8713 150.0000
weighted avg 0.8742 0.8733 0.8728 150.0000
Evaluation of prediction on testing dataset#
Testing data#
[10]:
#testining data
DTest = load_heart("test")
DTest.info()
<class 'pandas.core.frame.DataFrame'>
Index: 120 entries, 150 to 269
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 disease 120 non-null object
1 age 120 non-null int64
2 sex 120 non-null object
3 chestpain 120 non-null object
4 restbpress 120 non-null int64
5 cholesteral 120 non-null int64
6 sugar 120 non-null object
7 electro 120 non-null object
8 maxHeartRate 120 non-null int64
9 ExerciseAngina 120 non-null object
10 oldpeak 120 non-null float64
11 slope 120 non-null object
12 vesselsColored 120 non-null int64
13 thal 120 non-null object
dtypes: float64(1), int64(5), object(8)
memory usage: 14.1+ KB
[11]:
#split into X and y
yTest, XTest = DTest["disease"], DTest.drop(columns=["disease"])
eval_test = clf.eval_predict(XTest,yTest,verbose=True)
Observation Profile:
Read Used
Number of Observations 120 120
Number of Observations Classified into disease:
prediction absence presence Total
disease
absence 59 9 68
presence 11 41 52
Total 70 50 120
Percent Classified into disease:
prediction absence presence Total
disease
absence 86.764706 13.235294 100.0
presence 21.153846 78.846154 100.0
Total 58.333333 41.666667 100.0
Priors 0.546667 0.453333 NaN
Error Count Estimates for disease:
absence presence Total
Rate 0.132353 0.211538 0.16825
Priors 0.546667 0.453333 NaN
Classification Report for disease:
precision recall f1-score support
absence 0.842857 0.867647 0.855072 68.000000
presence 0.820000 0.788462 0.803922 52.000000
accuracy 0.833333 0.833333 0.833333 0.833333
macro avg 0.831429 0.828054 0.829497 120.000000
weighted avg 0.832952 0.833333 0.832907 120.000000