first commit after reset
[mixstore.git] / src / Mixstore / StaticBundle / Resources / views / about.html.twig
CommitLineData
929ca066
BA
1{% extends "::base.html.twig" %}
2
3{% block title %}{{ parent() }}about{% endblock %}
4
5{% block header %}
6{{ parent() }}
7<link rel="stylesheet" href="{{ asset('mixstore/css/static/about.css') }}">
8{% endblock %}
9
10{% block content %}
11
12<div id="maintext" class="row">
13
14<div class="col-xs-12 borderbottom">
15
16<h2>Origins</h2>
17
18In the late 1990's, three researchers wrote some code in MATLAB to classify data using
19mixture models. Initially named XEM for "EM-algorithms on miXture models",
20it was quickly renamed into mixmod, and rewritten in C++ from 2001.
21Since then, mixmod has been extended in several directions including:
22<ul>
23 <li>supervised classification</li>
24 <li>categorical data handling</li>
25 <li>heterogeneous data handling</li>
26</ul>
27...and the code is constantly evolving. {# still in constant evolution #}
28More details can be found on the <a href="http://www.mixmod.org">dedicated website</a>.
29
30There exist now many packages related to mixture models, each of them specialized in
31some domain. Although mixmod can (arguably) be considered as one of the first of its kind,
32it would be rather arbitrary to give him a central position.
33That is why mixmod is "only" part of the mix-store.
34
35{# (mixmod permet de faire + de choses : renvoyer au site web + doc...) #}
36
37<h2>Summary</h2>
38
39Mixstore is a website gathering libraries dedicated to data modeling as
40a mixture of probabilistic components. The computed mixture can be used
41for various purposes including
42<ul>
43 <li>density estimation</li>
44 <li>clustering (unsupervised classification)</li>
45 <li>(supervised) classification</li>
46 <li>regression, ...</li>
47</ul>
48
49<h2>Example</h2>
50
51<p>
52To start using any of the softwares present in the store, we need a dataset.
53We choose here an old classic: the Iris dataset introduced by Ronald Fisher in 1936.
54Despite its classicity this dataset is not so easy to analyze, as we will see in the following.
55</p>
56
57<p>
58The <a href="http://en.wikipedia.org/wiki/Iris_flower_data_set">Iris dataset</a>
59contains 150 rows, each of them composed of 4 continuous attributes which
60corresponds to some flowers measurements. 3 species are equally represented : (Iris)
61Setosa, Versicolor and Virginica.
62</p>
63
64<p>
65 <figure>
66 <img src="{{ asset('mixstore/images/iris_pca.png') }}" alt="PCA components of iris dataset"/><br/>
67 <caption>The two first PCA components of Iris dataset (image found
68 <a href="http://www.wanderinformatiker.at/unipages/general/img/iris_pca1.png">here</a>)</caption>
69 </figure>
70</p>
71
72<p>
73As the figure suggests the goal on this dataset is to discriminate Iris species.
74That is to say, our goal is to find a way to answer these questions:
75"are two given elements in the same group ?", "which group does a given element belongs to ?".
76</p>
77
78<p>
79The mixstore packages take a more general approach: they (try to) learn the data generation
80process, and then deduce the groups compositions. Thus, the two above questions can easily
81be answered by using the mathematical formulas describing the classes.
82Although this approach has several advantages (low sensitivity to outliers, likelihood
83to rank models...), finding the adequate model is challenging.
84We will not dive into such model selection details.
85{# This is a more general and harder problem. #}
86</p>
87
88</div>
89
90<div class="col-xs-12">
91
92<p>
93Density for 2 groups:
94££f^{(2)}(x) = \pi_1^{(2)} g_1^{(2)}(x) + \pi_2^{(2)} g_2^{(2)}(x)££
95where £g_i^{(2)} = (2 \pi)^{-d/2} \left| \Sigma_i^{(2)} \right|^{-1/2} \mbox{exp}\left( -\frac{1}{2} \, {}^T(x - \mu_i^{(2)}) (\Sigma_i^{(2)})^{-1} (x - \mu_i^{(2)}) \right)£.<br/>
96£x = (x_1,x_2,x_3,x_4)£ with the following correspondances.
97<ul>
98 <li>£x_1£: sepal length;</li>
99 <li>£x_2£: sepal width;</li>
100 <li>£x_3£: petal length;</li>
101 <li>£x_4£: petal width.</li>
102</ul>
103</p>
104
105</div>
106
107<div class="col-xs-12 col-sm-6">
108\begin{align*}
109\pi_1^{(2)} &= 0.33\\
110\mu_1^{(2)} &= (5.01 3.43 1.46 0.25)\\
111\Sigma_1^{(2)} &=
112 \begin{pmatrix}
113 0.15&0.13&0.02&0.01\\
114 0.13&0.18&0.02&0.01\\
115 0.02&0.02&0.03&0.01\\
116 0.01&0.01&0.01&0.01
117 \end{pmatrix}
118\end{align*}
119</div>
120
121<div class="col-xs-12 col-sm-6">
122\begin{align*}
123\pi_2^{(2)} &= 0.67\\
124\mu_2^{(2)} &= (6.26 2.87 4.91 1.68)\\
125\Sigma_2^{(2)} &=
126 \begin{pmatrix}
127 0.40&0.11&0.40&0.14\\
128 0.11&0.11&0.12&0.07\\
129 0.40&0.12&0.61&0.26\\
130 0.14&0.07&0.26&0.17
131 \end{pmatrix}
132\end{align*}
133</div>
134
135<div class="col-xs-12 borderbottom">
136Penalized log-likelihood (BIC): <b>-561.73</b>
137</div>
138
139<div class="col-xs-12">
140
141<p>
142Density for 3 groups:
143££f^{(3)}(x) = \pi_1^{(3)} g_1^{(3)}(x) + \pi_2^{(3)} g_2^{(3)}(x) + \pi_3^{(3)} g_3^{(3)}(x)££
144(Same parameterizations for cluster densities £g_i^{(3)}£).<br/>
145</p>
146
147</div>
148
149<div class="col-xs-12 col-md-4">
150\begin{align*}
151\pi_1^{(3)} &= 0.33\\
152\mu_1^{(3)} &= (5.01 3.43 1.46 0.25)\\
153\Sigma_1^{(3)} &=
154 \begin{pmatrix}
155 0.13&0.11&0.02&0.01\\
156 0.11&0.15&0.01&0.01\\
157 0.02&0.01&0.03&0.01\\
158 0.01&0.01&0.01&0.01
159 \end{pmatrix}
160\end{align*}
161</div>
162
163<div class="col-xs-12 col-md-4">
164\begin{align*}
165\pi_2^{(3)} &= 0.30\\
166\mu_2^{(3)} &= (5.91 2.78 4.20 1.30)\\
167\Sigma_2^{(3)} &=
168 \begin{pmatrix}
169 0.23&0.08&0.15&0.04\\
170 0.08&0.08&0.07&0.03\\
171 0.15&0.07&0.17&0.05\\
172 0.04&0.03&0.05&0.03
173 \end{pmatrix}
174\end{align*}
175</div>
176
177<div class="col-xs-12 col-md-4">
178\begin{align*}
179\pi_3^{(3)} &= 0.37\\
180\mu_3^{(3)} &= (6.55 2.95 5.48 1.96)\\
181\Sigma_3^{(3)} &=
182 \begin{pmatrix}
183 0.43&0.11&0.33&0.07\\
184 0.11&0.12&0.09&0.06\\
185 0.33&0.09&0.36&0.09\\
186 0.07&0.06&0.09&0.09
187 \end{pmatrix}
188\end{align*}
189</div>
190
191<div class="col-xs-12 borderbottom">
192Penalized log-likelihood (BIC): <b>-562.55</b>
193</div>
194
195<div class="col-xs-12">
196
197<p>
198As initially stated, the dataset is difficult to cluster because although we know there are
1993 species, 2 of them are almost undinstinguishable. That is why log-likelihood values are very close.
200We usually consider that a method is good on Iris dataset when it finds 3 clusters,
201but 2 is also a correct answer.
202</p>
203
204</div>
205
206</div>
207
208{% endblock %}
209
210{% block javascripts %}
211{{ parent() }}
212<script type="text/x-mathjax-config">
213MathJax.Hub.Config({
214 tex2jax: {
215 inlineMath: [['£','£']],
216 displayMath: [['££','££']],
217 skipTags: ["script","noscript","style"]//,"textarea","pre","code"]
218 }
219});
220</script>
221<script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script>
222{% endblock %}