Origins
In the late 1990's, three researchers wrote some code in MATLAB to classify data using mixture models. Initially named XEM for "EM-algorithms on miXture models", it was quickly renamed into mixmod, and rewritten in C++ from 2001. Since then, mixmod has been extended in several directions including:- supervised classification
- categorical data handling
- heterogeneous data handling
Summary
Mixstore is a website gathering libraries dedicated to data modeling as a mixture of probabilistic components. The computed mixture can be used for various purposes including- density estimation
- clustering (unsupervised classification)
- (supervised) classification
- regression, ...
Example
To start using any of the softwares present in the store, we need a dataset. We choose here an old classic: the Iris dataset introduced by Ronald Fisher in 1936. Despite its classicity this dataset is not so easy to analyze, as we will see in the following.
The Iris dataset contains 150 rows, each of them composed of 4 continuous attributes which corresponds to some flowers measurements. 3 species are equally represented : (Iris) Setosa, Versicolor and Virginica.
As the figure suggests the goal on this dataset is to discriminate Iris species. That is to say, our goal is to find a way to answer these questions: "are two given elements in the same group ?", "which group does a given element belongs to ?".
The mixstore packages take a more general approach: they (try to) learn the data generation process, and then deduce the groups compositions. Thus, the two above questions can easily be answered by using the mathematical formulas describing the classes. Although this approach has several advantages (low sensitivity to outliers, likelihood to rank models...), finding the adequate model is challenging. We will not dive into such model selection details. {# This is a more general and harder problem. #}
Density for 2 groups:
££f^{(2)}(x) = \pi_1^{(2)} g_1^{(2)}(x) + \pi_2^{(2)} g_2^{(2)}(x)££
where £g_i^{(2)} = (2 \pi)^{-d/2} \left| \Sigma_i^{(2)} \right|^{-1/2} \mbox{exp}\left( -\frac{1}{2} \, {}^T(x - \mu_i^{(2)}) (\Sigma_i^{(2)})^{-1} (x - \mu_i^{(2)}) \right)£.
£x = (x_1,x_2,x_3,x_4)£ with the following correspondances.
- £x_1£: sepal length;
- £x_2£: sepal width;
- £x_3£: petal length;
- £x_4£: petal width.
Density for 3 groups:
££f^{(3)}(x) = \pi_1^{(3)} g_1^{(3)}(x) + \pi_2^{(3)} g_2^{(3)}(x) + \pi_3^{(3)} g_3^{(3)}(x)££
(Same parameterizations for cluster densities £g_i^{(3)}£).
As initially stated, the dataset is difficult to cluster because although we know there are 3 species, 2 of them are almost undinstinguishable. That is why log-likelihood values are very close. We usually consider that a method is good on Iris dataset when it finds 3 clusters, but 2 is also a correct answer.