Commit | Line | Data |
---|---|---|
929ca066 BA |
1 | {% extends "::base.html.twig" %} |
2 | ||
3 | {% block title %}{{ parent() }}about{% endblock %} | |
4 | ||
5 | {% block header %} | |
6 | {{ parent() }} | |
7 | <link rel="stylesheet" href="{{ asset('mixstore/css/static/about.css') }}"> | |
8 | {% endblock %} | |
9 | ||
10 | {% block content %} | |
11 | ||
12 | <div id="maintext" class="row"> | |
13 | ||
14 | <div class="col-xs-12 borderbottom"> | |
15 | ||
16 | <h2>Origins</h2> | |
17 | ||
18 | In the late 1990's, three researchers wrote some code in MATLAB to classify data using | |
19 | mixture models. Initially named XEM for "EM-algorithms on miXture models", | |
20 | it was quickly renamed into mixmod, and rewritten in C++ from 2001. | |
21 | Since then, mixmod has been extended in several directions including: | |
22 | <ul> | |
23 | <li>supervised classification</li> | |
24 | <li>categorical data handling</li> | |
25 | <li>heterogeneous data handling</li> | |
26 | </ul> | |
27 | ...and the code is constantly evolving. {# still in constant evolution #} | |
28 | More details can be found on the <a href="http://www.mixmod.org">dedicated website</a>. | |
29 | ||
30 | There exist now many packages related to mixture models, each of them specialized in | |
31 | some domain. Although mixmod can (arguably) be considered as one of the first of its kind, | |
32 | it would be rather arbitrary to give him a central position. | |
33 | That is why mixmod is "only" part of the mix-store. | |
34 | ||
35 | {# (mixmod permet de faire + de choses : renvoyer au site web + doc...) #} | |
36 | ||
37 | <h2>Summary</h2> | |
38 | ||
39 | Mixstore is a website gathering libraries dedicated to data modeling as | |
40 | a mixture of probabilistic components. The computed mixture can be used | |
41 | for various purposes including | |
42 | <ul> | |
43 | <li>density estimation</li> | |
44 | <li>clustering (unsupervised classification)</li> | |
45 | <li>(supervised) classification</li> | |
46 | <li>regression, ...</li> | |
47 | </ul> | |
48 | ||
49 | <h2>Example</h2> | |
50 | ||
51 | <p> | |
52 | To start using any of the softwares present in the store, we need a dataset. | |
53 | We choose here an old classic: the Iris dataset introduced by Ronald Fisher in 1936. | |
54 | Despite its classicity this dataset is not so easy to analyze, as we will see in the following. | |
55 | </p> | |
56 | ||
57 | <p> | |
58 | The <a href="http://en.wikipedia.org/wiki/Iris_flower_data_set">Iris dataset</a> | |
59 | contains 150 rows, each of them composed of 4 continuous attributes which | |
60 | corresponds to some flowers measurements. 3 species are equally represented : (Iris) | |
61 | Setosa, Versicolor and Virginica. | |
62 | </p> | |
63 | ||
64 | <p> | |
65 | <figure> | |
66 | <img src="{{ asset('mixstore/images/iris_pca.png') }}" alt="PCA components of iris dataset"/><br/> | |
67 | <caption>The two first PCA components of Iris dataset (image found | |
68 | <a href="http://www.wanderinformatiker.at/unipages/general/img/iris_pca1.png">here</a>)</caption> | |
69 | </figure> | |
70 | </p> | |
71 | ||
72 | <p> | |
73 | As the figure suggests the goal on this dataset is to discriminate Iris species. | |
74 | That is to say, our goal is to find a way to answer these questions: | |
75 | "are two given elements in the same group ?", "which group does a given element belongs to ?". | |
76 | </p> | |
77 | ||
78 | <p> | |
79 | The mixstore packages take a more general approach: they (try to) learn the data generation | |
80 | process, and then deduce the groups compositions. Thus, the two above questions can easily | |
81 | be answered by using the mathematical formulas describing the classes. | |
82 | Although this approach has several advantages (low sensitivity to outliers, likelihood | |
83 | to rank models...), finding the adequate model is challenging. | |
84 | We will not dive into such model selection details. | |
85 | {# This is a more general and harder problem. #} | |
86 | </p> | |
87 | ||
88 | </div> | |
89 | ||
90 | <div class="col-xs-12"> | |
91 | ||
92 | <p> | |
93 | Density for 2 groups: | |
94 | ££f^{(2)}(x) = \pi_1^{(2)} g_1^{(2)}(x) + \pi_2^{(2)} g_2^{(2)}(x)££ | |
95 | where £g_i^{(2)} = (2 \pi)^{-d/2} \left| \Sigma_i^{(2)} \right|^{-1/2} \mbox{exp}\left( -\frac{1}{2} \, {}^T(x - \mu_i^{(2)}) (\Sigma_i^{(2)})^{-1} (x - \mu_i^{(2)}) \right)£.<br/> | |
96 | £x = (x_1,x_2,x_3,x_4)£ with the following correspondances. | |
97 | <ul> | |
98 | <li>£x_1£: sepal length;</li> | |
99 | <li>£x_2£: sepal width;</li> | |
100 | <li>£x_3£: petal length;</li> | |
101 | <li>£x_4£: petal width.</li> | |
102 | </ul> | |
103 | </p> | |
104 | ||
105 | </div> | |
106 | ||
107 | <div class="col-xs-12 col-sm-6"> | |
108 | \begin{align*} | |
109 | \pi_1^{(2)} &= 0.33\\ | |
110 | \mu_1^{(2)} &= (5.01 3.43 1.46 0.25)\\ | |
111 | \Sigma_1^{(2)} &= | |
112 | \begin{pmatrix} | |
113 | 0.15&0.13&0.02&0.01\\ | |
114 | 0.13&0.18&0.02&0.01\\ | |
115 | 0.02&0.02&0.03&0.01\\ | |
116 | 0.01&0.01&0.01&0.01 | |
117 | \end{pmatrix} | |
118 | \end{align*} | |
119 | </div> | |
120 | ||
121 | <div class="col-xs-12 col-sm-6"> | |
122 | \begin{align*} | |
123 | \pi_2^{(2)} &= 0.67\\ | |
124 | \mu_2^{(2)} &= (6.26 2.87 4.91 1.68)\\ | |
125 | \Sigma_2^{(2)} &= | |
126 | \begin{pmatrix} | |
127 | 0.40&0.11&0.40&0.14\\ | |
128 | 0.11&0.11&0.12&0.07\\ | |
129 | 0.40&0.12&0.61&0.26\\ | |
130 | 0.14&0.07&0.26&0.17 | |
131 | \end{pmatrix} | |
132 | \end{align*} | |
133 | </div> | |
134 | ||
135 | <div class="col-xs-12 borderbottom"> | |
136 | Penalized log-likelihood (BIC): <b>-561.73</b> | |
137 | </div> | |
138 | ||
139 | <div class="col-xs-12"> | |
140 | ||
141 | <p> | |
142 | Density for 3 groups: | |
143 | ££f^{(3)}(x) = \pi_1^{(3)} g_1^{(3)}(x) + \pi_2^{(3)} g_2^{(3)}(x) + \pi_3^{(3)} g_3^{(3)}(x)££ | |
144 | (Same parameterizations for cluster densities £g_i^{(3)}£).<br/> | |
145 | </p> | |
146 | ||
147 | </div> | |
148 | ||
149 | <div class="col-xs-12 col-md-4"> | |
150 | \begin{align*} | |
151 | \pi_1^{(3)} &= 0.33\\ | |
152 | \mu_1^{(3)} &= (5.01 3.43 1.46 0.25)\\ | |
153 | \Sigma_1^{(3)} &= | |
154 | \begin{pmatrix} | |
155 | 0.13&0.11&0.02&0.01\\ | |
156 | 0.11&0.15&0.01&0.01\\ | |
157 | 0.02&0.01&0.03&0.01\\ | |
158 | 0.01&0.01&0.01&0.01 | |
159 | \end{pmatrix} | |
160 | \end{align*} | |
161 | </div> | |
162 | ||
163 | <div class="col-xs-12 col-md-4"> | |
164 | \begin{align*} | |
165 | \pi_2^{(3)} &= 0.30\\ | |
166 | \mu_2^{(3)} &= (5.91 2.78 4.20 1.30)\\ | |
167 | \Sigma_2^{(3)} &= | |
168 | \begin{pmatrix} | |
169 | 0.23&0.08&0.15&0.04\\ | |
170 | 0.08&0.08&0.07&0.03\\ | |
171 | 0.15&0.07&0.17&0.05\\ | |
172 | 0.04&0.03&0.05&0.03 | |
173 | \end{pmatrix} | |
174 | \end{align*} | |
175 | </div> | |
176 | ||
177 | <div class="col-xs-12 col-md-4"> | |
178 | \begin{align*} | |
179 | \pi_3^{(3)} &= 0.37\\ | |
180 | \mu_3^{(3)} &= (6.55 2.95 5.48 1.96)\\ | |
181 | \Sigma_3^{(3)} &= | |
182 | \begin{pmatrix} | |
183 | 0.43&0.11&0.33&0.07\\ | |
184 | 0.11&0.12&0.09&0.06\\ | |
185 | 0.33&0.09&0.36&0.09\\ | |
186 | 0.07&0.06&0.09&0.09 | |
187 | \end{pmatrix} | |
188 | \end{align*} | |
189 | </div> | |
190 | ||
191 | <div class="col-xs-12 borderbottom"> | |
192 | Penalized log-likelihood (BIC): <b>-562.55</b> | |
193 | </div> | |
194 | ||
195 | <div class="col-xs-12"> | |
196 | ||
197 | <p> | |
198 | As initially stated, the dataset is difficult to cluster because although we know there are | |
199 | 3 species, 2 of them are almost undinstinguishable. That is why log-likelihood values are very close. | |
200 | We usually consider that a method is good on Iris dataset when it finds 3 clusters, | |
201 | but 2 is also a correct answer. | |
202 | </p> | |
203 | ||
204 | </div> | |
205 | ||
206 | </div> | |
207 | ||
208 | {% endblock %} | |
209 | ||
210 | {% block javascripts %} | |
211 | {{ parent() }} | |
212 | <script type="text/x-mathjax-config"> | |
213 | MathJax.Hub.Config({ | |
214 | tex2jax: { | |
215 | inlineMath: [['£','£']], | |
216 | displayMath: [['££','££']], | |
217 | skipTags: ["script","noscript","style"]//,"textarea","pre","code"] | |
218 | } | |
219 | }); | |
220 | </script> | |
221 | <script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script> | |
222 | {% endblock %} |