Getting started#
discrimintools is an open source python library dedicated to Discriminant Analysis (DA) distributed under the MIT Licence.
The purpose of this guide is to illustrate some of the main features of discrimintools. It assumes basic working knowledge of scikit-learn practices.
Fitting and predicting: estimator basics#
As scikit-learn, discrimintools provides models called estimators. Each estimator can be fitted to some data using its fit method.
Here is a simple example where we fit a DISCRIM to load_alcools data:
>>> from discrimintools.datasets import load_alcools
>>> from discrimintools import DISCRIM
>>> D = load_alcools("train") # load training data
>>> y, X = D["TYPE"], D.drop(columns=["TYPE"]) # split into X and y
>>> clf = DISCRIM()
>>> clf.fit(X, y)
DISCRIM(priors='prop')
The fit method generally accepts 2 inputs:
The samples DataFrame (or design matrix)
X. The size ofXis typically(n_samples, n_features), which means that samples are represented as rows and features are represented as columns.The target values
ywhich are true lables forX.yis a pandas Series with categorical terms. For unsupervised learning tasks,ydoes not need to be specified, such asGFAandMPCA.
Both X and y are expected to be pandas objects : DataFrame for X and Categorical Series for y.
Once the estimator is fitted, it can be used for predicting target values of new data. You don’t need to re-train the estimator:
>>> DTest = load_alcools("test") # load testing data
>>> yTest, XTest = DTest["TYPE"], DTest.drop(columns=["TYPE"])
>>> clf.predict(X).head() # predict classes of the training data
...
>>> clf.predict(XTest).head() # predict classes of new (test) data
Pipelines: chaining pre-processors and estimators#
Transformers and estimators (predictors) can be combined together into a
single unifying object: a Pipeline. The sklearn pipeline
offers the same API as a regular estimator: it can be fitted and used for
prediction with fit and predict. As we will see later, using a
pipeline will also prevent you from data leakage, i.e. disclosing some
testing data in your training data.
In the following example, we use the load_alcools dataset to run a linear discriminant analysis on principal components:
>>> from discrimintools.datasets import load_alcools
>>> from sklearn.pipeline import Pipeline
>>> from discrimintools import GFA, DISCRIM
>>> D = load_alcools("train")
>>> y, X = D["TYPE"], D.drop(columns=["TYPE"]) # split into X and y
...
>>> #create a pipeline object
>>> clf = Pipeline([("gfa",GFA(n_components=2)),
... ("lda",DISCRIM(method="linear",warn_message=False))])
>>> clf.fit(X,y)
For more about machine learning terms, see sklearn glossary.