CANDISC - heart dataset#
[1]:
#disable warnings
from warnings import simplefilter, filterwarnings
simplefilter(action='ignore', category=FutureWarning)
filterwarnings("ignore")
heart dataset#
[2]:
#vins dataset
from discrimintools.datasets import load_heart
D = load_heart("train")
print(D.info())
<class 'pandas.core.frame.DataFrame'>
Index: 150 entries, 0 to 149
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 disease 150 non-null object
1 age 150 non-null int64
2 sex 150 non-null object
3 chestpain 150 non-null object
4 restbpress 150 non-null int64
5 cholesteral 150 non-null int64
6 sugar 150 non-null object
7 electro 150 non-null object
8 maxHeartRate 150 non-null int64
9 ExerciseAngina 150 non-null object
10 oldpeak 150 non-null float64
11 slope 150 non-null object
12 vesselsColored 150 non-null int64
13 thal 150 non-null object
dtypes: float64(1), int64(5), object(8)
memory usage: 17.6+ KB
None
[3]:
#split into X and y
y, X = D["disease"], D.drop(columns=["disease"])
instanciation & training#
[4]:
from discrimintools import CANDISC
clf = CANDISC(n_components=2)
clf.fit(X,y)
Categorical features have been encoded into binary variables.
[4]:
CANDISC()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| n_components | 2 | |
| classes | None | |
| warn_message | True |
Evaluatin of prediction on training data#
[5]:
#eval_predict function
eval_train = clf.eval_predict(X,y,verbose=True)
Observation Profile:
Read Used
Number of Observations 150 150
Number of Observations Classified into disease:
prediction absence presence Total
disease
absence 75 7 82
presence 12 56 68
Total 87 63 150
Percent Classified into disease:
prediction absence presence Total
disease
absence 91.463415 8.536585 100.0
presence 17.647059 82.352941 100.0
Total 58.000000 42.000000 100.0
Priors 0.546667 0.453333 NaN
Error Count Estimates for disease:
absence presence Total
Rate 0.085366 0.176471 0.126667
Priors 0.546667 0.453333 NaN
Classification Report for disease:
precision recall f1-score support
absence 0.862069 0.914634 0.887574 82.000000
presence 0.888889 0.823529 0.854962 68.000000
accuracy 0.873333 0.873333 0.873333 0.873333
macro avg 0.875479 0.869082 0.871268 150.000000
weighted avg 0.874227 0.873333 0.872790 150.000000
[6]:
#score function
print("Accuracy : {}%".format(100*round(clf.score(X,y),2)))
Accuracy : 87.0%
[7]:
#error rate
print("Error rate : {}%".format(100-100*round(clf.score(X,y),2)))
Error rate : 13.0%
summary#
[8]:
from discrimintools import summaryCANDISC
summaryCANDISC(clf,detailed=True)
Canonical Discriminant Analysis - Results
Summary Information:
infos Value DF DF value
0 Total Sample Size 150 DF Total 149
1 Variables 18 DF Within Classes 148
2 Classes 2 DF Between Classes 1
Class Level Information:
Frequency Proportion Prior Probability
absence 82 0.5467 0.5467
presence 68 0.4533 0.4533
Total-Sample Class Means:
absence presence
age 53.0244 56.3824
sexmale 0.5488 0.8529
chestpainatypicalAngina 0.2073 0.0441
chestpainnonAnginal 0.4390 0.1324
chestpaintypicalAngina 0.1098 0.0294
restbpress 129.3902 135.6471
cholesteral 243.6951 249.1912
sugarlow 0.8293 0.8529
electrosttAbnormality 0.0000 0.0147
electroventricHypertrophy 0.3780 0.6029
maxHeartRate 159.3049 139.2794
ExerciseAnginayes 0.1220 0.5000
oldpeak 0.6415 1.6279
slopeflat 0.3049 0.5882
slopeupsloping 0.6220 0.2941
vesselsColored 0.3171 1.1029
thalnormal 0.7683 0.3088
thalreversableEffect 0.1829 0.6176
Importance of components:
Eigenvalue Difference Proportion Cumulative
Can1 1.5613 NaN 100.0 100.0
Raw Canonical and Classification Functions Coefficients:
Can1 absence presence
Constant 1.0245 -0.0847 -3.1163
age -0.0031 -0.0035 0.0042
sexmale -0.6604 -0.7464 0.9000
chestpainatypicalAngina 0.9829 1.1110 -1.3397
chestpainnonAnginal 0.7308 0.8260 -0.9960
chestpaintypicalAngina 1.8507 2.0918 -2.5224
restbpress -0.0092 -0.0104 0.0126
cholesteral 0.0014 0.0015 -0.0018
sugarlow -0.5250 -0.5933 0.7155
electrosttAbnormality -1.1995 -1.3557 1.6348
electroventricHypertrophy -0.2464 -0.2784 0.3358
maxHeartRate 0.0111 0.0125 -0.0151
ExerciseAnginayes -0.4188 -0.4734 0.5708
oldpeak -0.2847 -0.3218 0.3881
slopeflat -0.4727 -0.5343 0.6443
slopeupsloping -0.2029 -0.2293 0.2766
vesselsColored -0.5928 -0.6700 0.8079
thalnormal 0.4137 0.4676 -0.5638
thalreversableEffect -0.5253 -0.5937 0.7159
Test of H0: The canonical correlations in the current row and all that follow are zero
Canonical Correlation Squared Canonical Correlation Likelihood Ratio \
0 0.7808 0.6096 0.3904
Approximate F value Num DF Den DF Pr>F Chi-Square DF Pr>Chi2
0 11.3629 18 131 0.0 130.7323 18 0.0
Classification Summary for Calibration Data:
Observation Profile:
Read Used
Number of Observations 150 150
Number of Observations Classified into disease:
prediction absence presence Total
disease
absence 75 7 82
presence 12 56 68
Total 87 63 150
Percent Classified into disease:
prediction absence presence Total
disease
absence 91.4634 8.5366 100.0
presence 17.6471 82.3529 100.0
Total 58.0000 42.0000 100.0
Priors 0.5467 0.4533 NaN
Error Count Estimates for disease:
absence presence Total
Rate 0.0854 0.1765 0.1267
Priors 0.5467 0.4533 NaN
Classification Report for disease:
precision recall f1-score support
absence 0.8621 0.9146 0.8876 82.0000
presence 0.8889 0.8235 0.8550 68.0000
accuracy 0.8733 0.8733 0.8733 0.8733
macro avg 0.8755 0.8691 0.8713 150.0000
weighted avg 0.8742 0.8733 0.8728 150.0000
Evaluation of prediction on testing dataset#
Testing data#
[9]:
#testining data
DTest = load_heart("test")
print(DTest.info())
<class 'pandas.core.frame.DataFrame'>
Index: 120 entries, 150 to 269
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 disease 120 non-null object
1 age 120 non-null int64
2 sex 120 non-null object
3 chestpain 120 non-null object
4 restbpress 120 non-null int64
5 cholesteral 120 non-null int64
6 sugar 120 non-null object
7 electro 120 non-null object
8 maxHeartRate 120 non-null int64
9 ExerciseAngina 120 non-null object
10 oldpeak 120 non-null float64
11 slope 120 non-null object
12 vesselsColored 120 non-null int64
13 thal 120 non-null object
dtypes: float64(1), int64(5), object(8)
memory usage: 14.1+ KB
None
[10]:
#split into X and y
yTest, XTest = DTest["disease"], DTest.drop(columns=["disease"])
eval_test = clf.eval_predict(XTest,yTest,verbose=True)
Observation Profile:
Read Used
Number of Observations 120 120
Number of Observations Classified into disease:
prediction absence presence Total
disease
absence 59 9 68
presence 11 41 52
Total 70 50 120
Percent Classified into disease:
prediction absence presence Total
disease
absence 86.764706 13.235294 100.0
presence 21.153846 78.846154 100.0
Total 58.333333 41.666667 100.0
Priors 0.546667 0.453333 NaN
Error Count Estimates for disease:
absence presence Total
Rate 0.132353 0.211538 0.16825
Priors 0.546667 0.453333 NaN
Classification Report for disease:
precision recall f1-score support
absence 0.842857 0.867647 0.855072 68.000000
presence 0.820000 0.788462 0.803922 52.000000
accuracy 0.833333 0.833333 0.833333 0.833333
macro avg 0.831429 0.828054 0.829497 120.000000
weighted avg 0.832952 0.833333 0.832907 120.000000