CANDISC - heart dataset#

[1]:
#disable warnings
from warnings import simplefilter, filterwarnings
simplefilter(action='ignore', category=FutureWarning)
filterwarnings("ignore")

heart dataset#

[2]:
#vins dataset
from discrimintools.datasets import load_heart
D = load_heart("train")
print(D.info())
<class 'pandas.core.frame.DataFrame'>
Index: 150 entries, 0 to 149
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   disease         150 non-null    object
 1   age             150 non-null    int64
 2   sex             150 non-null    object
 3   chestpain       150 non-null    object
 4   restbpress      150 non-null    int64
 5   cholesteral     150 non-null    int64
 6   sugar           150 non-null    object
 7   electro         150 non-null    object
 8   maxHeartRate    150 non-null    int64
 9   ExerciseAngina  150 non-null    object
 10  oldpeak         150 non-null    float64
 11  slope           150 non-null    object
 12  vesselsColored  150 non-null    int64
 13  thal            150 non-null    object
dtypes: float64(1), int64(5), object(8)
memory usage: 17.6+ KB
None
[3]:
#split into X and y
y, X = D["disease"], D.drop(columns=["disease"])

instanciation & training#

[4]:
from discrimintools import CANDISC
clf = CANDISC(n_components=2)
clf.fit(X,y)

Categorical features have been encoded into binary variables.

[4]:
CANDISC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluatin of prediction on training data#

[5]:
#eval_predict function
eval_train = clf.eval_predict(X,y,verbose=True)
Observation Profile:
                        Read  Used
Number of Observations   150   150

Number of Observations Classified into disease:
prediction  absence  presence  Total
disease
absence          75         7     82
presence         12        56     68
Total            87        63    150

Percent Classified into disease:
prediction    absence   presence  Total
disease
absence     91.463415   8.536585  100.0
presence    17.647059  82.352941  100.0
Total       58.000000  42.000000  100.0
Priors       0.546667   0.453333    NaN

Error Count Estimates for disease:
         absence  presence     Total
Rate    0.085366  0.176471  0.126667
Priors  0.546667  0.453333       NaN

Classification Report for disease:
              precision    recall  f1-score     support
absence        0.862069  0.914634  0.887574   82.000000
presence       0.888889  0.823529  0.854962   68.000000
accuracy       0.873333  0.873333  0.873333    0.873333
macro avg      0.875479  0.869082  0.871268  150.000000
weighted avg   0.874227  0.873333  0.872790  150.000000
[6]:
#score function
print("Accuracy : {}%".format(100*round(clf.score(X,y),2)))
Accuracy : 87.0%
[7]:
#error rate
print("Error rate : {}%".format(100-100*round(clf.score(X,y),2)))
Error rate : 13.0%

summary#

[8]:
from discrimintools import summaryCANDISC
summaryCANDISC(clf,detailed=True)
                     Canonical Discriminant Analysis - Results

Summary Information:
               infos  Value                  DF  DF value
0  Total Sample Size    150            DF Total       149
1          Variables     18   DF Within Classes       148
2            Classes      2  DF Between Classes         1

Class Level Information:
          Frequency  Proportion  Prior Probability
absence          82      0.5467             0.5467
presence         68      0.4533             0.4533

Total-Sample Class Means:
                            absence  presence
age                         53.0244   56.3824
sexmale                      0.5488    0.8529
chestpainatypicalAngina      0.2073    0.0441
chestpainnonAnginal          0.4390    0.1324
chestpaintypicalAngina       0.1098    0.0294
restbpress                 129.3902  135.6471
cholesteral                243.6951  249.1912
sugarlow                     0.8293    0.8529
electrosttAbnormality        0.0000    0.0147
electroventricHypertrophy    0.3780    0.6029
maxHeartRate               159.3049  139.2794
ExerciseAnginayes            0.1220    0.5000
oldpeak                      0.6415    1.6279
slopeflat                    0.3049    0.5882
slopeupsloping               0.6220    0.2941
vesselsColored               0.3171    1.1029
thalnormal                   0.7683    0.3088
thalreversableEffect         0.1829    0.6176

Importance of components:
      Eigenvalue  Difference  Proportion  Cumulative
Can1      1.5613         NaN       100.0       100.0

Raw Canonical and Classification Functions Coefficients:
                             Can1  absence  presence
Constant                   1.0245  -0.0847   -3.1163
age                       -0.0031  -0.0035    0.0042
sexmale                   -0.6604  -0.7464    0.9000
chestpainatypicalAngina    0.9829   1.1110   -1.3397
chestpainnonAnginal        0.7308   0.8260   -0.9960
chestpaintypicalAngina     1.8507   2.0918   -2.5224
restbpress                -0.0092  -0.0104    0.0126
cholesteral                0.0014   0.0015   -0.0018
sugarlow                  -0.5250  -0.5933    0.7155
electrosttAbnormality     -1.1995  -1.3557    1.6348
electroventricHypertrophy -0.2464  -0.2784    0.3358
maxHeartRate               0.0111   0.0125   -0.0151
ExerciseAnginayes         -0.4188  -0.4734    0.5708
oldpeak                   -0.2847  -0.3218    0.3881
slopeflat                 -0.4727  -0.5343    0.6443
slopeupsloping            -0.2029  -0.2293    0.2766
vesselsColored            -0.5928  -0.6700    0.8079
thalnormal                 0.4137   0.4676   -0.5638
thalreversableEffect      -0.5253  -0.5937    0.7159

Test of H0: The canonical correlations in the current row and all that follow are zero
   Canonical Correlation  Squared Canonical Correlation  Likelihood Ratio  \
0                 0.7808                         0.6096            0.3904

   Approximate F value  Num DF  Den DF  Pr>F  Chi-Square  DF  Pr>Chi2
0              11.3629      18     131   0.0    130.7323  18      0.0

Classification Summary for Calibration Data:

Observation Profile:
                        Read  Used
Number of Observations   150   150

Number of Observations Classified into disease:
prediction  absence  presence  Total
disease
absence          75         7     82
presence         12        56     68
Total            87        63    150

Percent Classified into disease:
prediction  absence  presence  Total
disease
absence     91.4634    8.5366  100.0
presence    17.6471   82.3529  100.0
Total       58.0000   42.0000  100.0
Priors       0.5467    0.4533    NaN

Error Count Estimates for disease:
        absence  presence   Total
Rate     0.0854    0.1765  0.1267
Priors   0.5467    0.4533     NaN

Classification Report for disease:
              precision  recall  f1-score   support
absence          0.8621  0.9146    0.8876   82.0000
presence         0.8889  0.8235    0.8550   68.0000
accuracy         0.8733  0.8733    0.8733    0.8733
macro avg        0.8755  0.8691    0.8713  150.0000
weighted avg     0.8742  0.8733    0.8728  150.0000

Evaluation of prediction on testing dataset#

Testing data#

[9]:
#testining data
DTest = load_heart("test")
print(DTest.info())
<class 'pandas.core.frame.DataFrame'>
Index: 120 entries, 150 to 269
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   disease         120 non-null    object
 1   age             120 non-null    int64
 2   sex             120 non-null    object
 3   chestpain       120 non-null    object
 4   restbpress      120 non-null    int64
 5   cholesteral     120 non-null    int64
 6   sugar           120 non-null    object
 7   electro         120 non-null    object
 8   maxHeartRate    120 non-null    int64
 9   ExerciseAngina  120 non-null    object
 10  oldpeak         120 non-null    float64
 11  slope           120 non-null    object
 12  vesselsColored  120 non-null    int64
 13  thal            120 non-null    object
dtypes: float64(1), int64(5), object(8)
memory usage: 14.1+ KB
None
[10]:
#split into X and y
yTest, XTest = DTest["disease"], DTest.drop(columns=["disease"])
eval_test = clf.eval_predict(XTest,yTest,verbose=True)
Observation Profile:
                        Read  Used
Number of Observations   120   120

Number of Observations Classified into disease:
prediction  absence  presence  Total
disease
absence          59         9     68
presence         11        41     52
Total            70        50    120

Percent Classified into disease:
prediction    absence   presence  Total
disease
absence     86.764706  13.235294  100.0
presence    21.153846  78.846154  100.0
Total       58.333333  41.666667  100.0
Priors       0.546667   0.453333    NaN

Error Count Estimates for disease:
         absence  presence    Total
Rate    0.132353  0.211538  0.16825
Priors  0.546667  0.453333      NaN

Classification Report for disease:
              precision    recall  f1-score     support
absence        0.842857  0.867647  0.855072   68.000000
presence       0.820000  0.788462  0.803922   52.000000
accuracy       0.833333  0.833333  0.833333    0.833333
macro avg      0.831429  0.828054  0.829497  120.000000
weighted avg   0.832952  0.833333  0.832907  120.000000