.. _getting_started: =============== Getting started =============== `discrimintools `_ is an open source python library dedicated to Discriminant Analysis (DA) distributed under the MIT Licence. The purpose of this guide is to illustrate some of the main features of ``discrimintools``. It assumes basic working knowledge of `scikit-learn `_ practices. Fitting and predicting: estimator basics ---------------------------------------- As `scikit-learn `_, `discrimintools `_ provides models called `estimators `_. Each estimator can be fitted to some data using its `fit `_ method. Here is a simple example where we fit a :class:`~discrimintools.DISCRIM` to :class:`~discrimintools.datasets.load_alcools` data: >>> from discrimintools.datasets import load_alcools >>> from discrimintools import DISCRIM >>> D = load_alcools("train") # load training data >>> y, X = D["TYPE"], D.drop(columns=["TYPE"]) # split into X and y >>> clf = DISCRIM() >>> clf.fit(X, y) DISCRIM(priors='prop') The ``fit`` method generally accepts 2 inputs: - The samples DataFrame (or design matrix) ``X``. The size of ``X`` is typically ``(n_samples, n_features)``, which means that samples are represented as rows and features are represented as columns. - The target values ``y`` which are true lables for ``X``. ``y`` is a pandas Series with categorical terms. For unsupervised learning tasks, ``y`` does not need to be specified, such as :class:`~discrimintools.GFA` and :class:`~discrimintools.MPCA`. Both ``X`` and ``y`` are expected to be pandas objects : `DataFrame `_ for ``X`` and `Categorical Series `_ for ``y``. Once the estimator is fitted, it can be used for predicting target values of new data. You don't need to re-train the estimator:: >>> DTest = load_alcools("test") # load testing data >>> yTest, XTest = DTest["TYPE"], DTest.drop(columns=["TYPE"]) >>> clf.predict(X).head() # predict classes of the training data ... >>> clf.predict(XTest).head() # predict classes of new (test) data Pipelines: chaining pre-processors and estimators ------------------------------------------------- Transformers and estimators (predictors) can be combined together into a single unifying object: a `Pipeline `_. The sklearn pipeline offers the same API as a regular estimator: it can be fitted and used for prediction with ``fit`` and ``predict``. As we will see later, using a pipeline will also prevent you from data leakage, i.e. disclosing some testing data in your training data. In the following example, we use the :class:`~discrimintools.datasets.load_alcools` dataset to run a linear discriminant analysis on principal components: >>> from discrimintools.datasets import load_alcools >>> from sklearn.pipeline import Pipeline >>> from discrimintools import GFA, DISCRIM >>> D = load_alcools("train") >>> y, X = D["TYPE"], D.drop(columns=["TYPE"]) # split into X and y ... >>> #create a pipeline object >>> clf = Pipeline([("gfa",GFA(n_components=2)), ... ("lda",DISCRIM(method="linear",warn_message=False))]) >>> clf.fit(X,y) For more about machine learning terms, see `sklearn glossary `_.