| 1 | {% extends "::base.html.twig" %} |
| 2 | |
| 3 | {% block title %}{{ parent() }}about{% endblock %} |
| 4 | |
| 5 | {% block header %} |
| 6 | {{ parent() }} |
| 7 | <link rel="stylesheet" href="{{ asset('mixstore/css/static/about.css') }}"> |
| 8 | {% endblock %} |
| 9 | |
| 10 | {% block content %} |
| 11 | |
| 12 | <div id="maintext" class="row"> |
| 13 | |
| 14 | <div class="col-xs-12 borderbottom"> |
| 15 | |
| 16 | <h2>Origins</h2> |
| 17 | |
| 18 | In the late 1990's, three researchers wrote some code in MATLAB to classify data using |
| 19 | mixture models. Initially named XEM for "EM-algorithms on miXture models", |
| 20 | it was quickly renamed into mixmod, and rewritten in C++ from 2001. |
| 21 | Since then, mixmod has been extended in several directions including: |
| 22 | <ul> |
| 23 | <li>supervised classification</li> |
| 24 | <li>categorical data handling</li> |
| 25 | <li>heterogeneous data handling</li> |
| 26 | </ul> |
| 27 | ...and the code is constantly evolving. {# still in constant evolution #} |
| 28 | More details can be found on the <a href="http://www.mixmod.org">dedicated website</a>. |
| 29 | |
| 30 | There exist now many packages related to mixture models, each of them specialized in |
| 31 | some domain. Although mixmod can (arguably) be considered as one of the first of its kind, |
| 32 | it would be rather arbitrary to give him a central position. |
| 33 | That is why mixmod is "only" part of the mix-store. |
| 34 | |
| 35 | {# (mixmod permet de faire + de choses : renvoyer au site web + doc...) #} |
| 36 | |
| 37 | <h2>Summary</h2> |
| 38 | |
| 39 | Mixstore is a website gathering libraries dedicated to data modeling as |
| 40 | a mixture of probabilistic components. The computed mixture can be used |
| 41 | for various purposes including |
| 42 | <ul> |
| 43 | <li>density estimation</li> |
| 44 | <li>clustering (unsupervised classification)</li> |
| 45 | <li>(supervised) classification</li> |
| 46 | <li>regression, ...</li> |
| 47 | </ul> |
| 48 | |
| 49 | <h2>Example</h2> |
| 50 | |
| 51 | <p> |
| 52 | To start using any of the softwares present in the store, we need a dataset. |
| 53 | We choose here an old classic: the Iris dataset introduced by Ronald Fisher in 1936. |
| 54 | Despite its classicity this dataset is not so easy to analyze, as we will see in the following. |
| 55 | </p> |
| 56 | |
| 57 | <p> |
| 58 | The <a href="http://en.wikipedia.org/wiki/Iris_flower_data_set">Iris dataset</a> |
| 59 | contains 150 rows, each of them composed of 4 continuous attributes which |
| 60 | corresponds to some flowers measurements. 3 species are equally represented : (Iris) |
| 61 | Setosa, Versicolor and Virginica. |
| 62 | </p> |
| 63 | |
| 64 | <p> |
| 65 | <figure> |
| 66 | <img src="{{ asset('mixstore/images/iris_pca.png') }}" alt="PCA components of iris dataset"/><br/> |
| 67 | <caption>The two first PCA components of Iris dataset (image found |
| 68 | <a href="http://www.wanderinformatiker.at/unipages/general/img/iris_pca1.png">here</a>)</caption> |
| 69 | </figure> |
| 70 | </p> |
| 71 | |
| 72 | <p> |
| 73 | As the figure suggests the goal on this dataset is to discriminate Iris species. |
| 74 | That is to say, our goal is to find a way to answer these questions: |
| 75 | "are two given elements in the same group ?", "which group does a given element belongs to ?". |
| 76 | </p> |
| 77 | |
| 78 | <p> |
| 79 | The mixstore packages take a more general approach: they (try to) learn the data generation |
| 80 | process, and then deduce the groups compositions. Thus, the two above questions can easily |
| 81 | be answered by using the mathematical formulas describing the classes. |
| 82 | Although this approach has several advantages (low sensitivity to outliers, likelihood |
| 83 | to rank models...), finding the adequate model is challenging. |
| 84 | We will not dive into such model selection details. |
| 85 | {# This is a more general and harder problem. #} |
| 86 | </p> |
| 87 | |
| 88 | </div> |
| 89 | |
| 90 | <div class="col-xs-12"> |
| 91 | |
| 92 | <p> |
| 93 | Density for 2 groups: |
| 94 | ££f^{(2)}(x) = \pi_1^{(2)} g_1^{(2)}(x) + \pi_2^{(2)} g_2^{(2)}(x)££ |
| 95 | where £g_i^{(2)} = (2 \pi)^{-d/2} \left| \Sigma_i^{(2)} \right|^{-1/2} \mbox{exp}\left( -\frac{1}{2} \, {}^T(x - \mu_i^{(2)}) (\Sigma_i^{(2)})^{-1} (x - \mu_i^{(2)}) \right)£.<br/> |
| 96 | £x = (x_1,x_2,x_3,x_4)£ with the following correspondances. |
| 97 | <ul> |
| 98 | <li>£x_1£: sepal length;</li> |
| 99 | <li>£x_2£: sepal width;</li> |
| 100 | <li>£x_3£: petal length;</li> |
| 101 | <li>£x_4£: petal width.</li> |
| 102 | </ul> |
| 103 | </p> |
| 104 | |
| 105 | </div> |
| 106 | |
| 107 | <div class="col-xs-12 col-sm-6"> |
| 108 | \begin{align*} |
| 109 | \pi_1^{(2)} &= 0.33\\ |
| 110 | \mu_1^{(2)} &= (5.01 3.43 1.46 0.25)\\ |
| 111 | \Sigma_1^{(2)} &= |
| 112 | \begin{pmatrix} |
| 113 | 0.15&0.13&0.02&0.01\\ |
| 114 | 0.13&0.18&0.02&0.01\\ |
| 115 | 0.02&0.02&0.03&0.01\\ |
| 116 | 0.01&0.01&0.01&0.01 |
| 117 | \end{pmatrix} |
| 118 | \end{align*} |
| 119 | </div> |
| 120 | |
| 121 | <div class="col-xs-12 col-sm-6"> |
| 122 | \begin{align*} |
| 123 | \pi_2^{(2)} &= 0.67\\ |
| 124 | \mu_2^{(2)} &= (6.26 2.87 4.91 1.68)\\ |
| 125 | \Sigma_2^{(2)} &= |
| 126 | \begin{pmatrix} |
| 127 | 0.40&0.11&0.40&0.14\\ |
| 128 | 0.11&0.11&0.12&0.07\\ |
| 129 | 0.40&0.12&0.61&0.26\\ |
| 130 | 0.14&0.07&0.26&0.17 |
| 131 | \end{pmatrix} |
| 132 | \end{align*} |
| 133 | </div> |
| 134 | |
| 135 | <div class="col-xs-12 borderbottom"> |
| 136 | Penalized log-likelihood (BIC): <b>-561.73</b> |
| 137 | </div> |
| 138 | |
| 139 | <div class="col-xs-12"> |
| 140 | |
| 141 | <p> |
| 142 | Density for 3 groups: |
| 143 | ££f^{(3)}(x) = \pi_1^{(3)} g_1^{(3)}(x) + \pi_2^{(3)} g_2^{(3)}(x) + \pi_3^{(3)} g_3^{(3)}(x)££ |
| 144 | (Same parameterizations for cluster densities £g_i^{(3)}£).<br/> |
| 145 | </p> |
| 146 | |
| 147 | </div> |
| 148 | |
| 149 | <div class="col-xs-12 col-md-4"> |
| 150 | \begin{align*} |
| 151 | \pi_1^{(3)} &= 0.33\\ |
| 152 | \mu_1^{(3)} &= (5.01 3.43 1.46 0.25)\\ |
| 153 | \Sigma_1^{(3)} &= |
| 154 | \begin{pmatrix} |
| 155 | 0.13&0.11&0.02&0.01\\ |
| 156 | 0.11&0.15&0.01&0.01\\ |
| 157 | 0.02&0.01&0.03&0.01\\ |
| 158 | 0.01&0.01&0.01&0.01 |
| 159 | \end{pmatrix} |
| 160 | \end{align*} |
| 161 | </div> |
| 162 | |
| 163 | <div class="col-xs-12 col-md-4"> |
| 164 | \begin{align*} |
| 165 | \pi_2^{(3)} &= 0.30\\ |
| 166 | \mu_2^{(3)} &= (5.91 2.78 4.20 1.30)\\ |
| 167 | \Sigma_2^{(3)} &= |
| 168 | \begin{pmatrix} |
| 169 | 0.23&0.08&0.15&0.04\\ |
| 170 | 0.08&0.08&0.07&0.03\\ |
| 171 | 0.15&0.07&0.17&0.05\\ |
| 172 | 0.04&0.03&0.05&0.03 |
| 173 | \end{pmatrix} |
| 174 | \end{align*} |
| 175 | </div> |
| 176 | |
| 177 | <div class="col-xs-12 col-md-4"> |
| 178 | \begin{align*} |
| 179 | \pi_3^{(3)} &= 0.37\\ |
| 180 | \mu_3^{(3)} &= (6.55 2.95 5.48 1.96)\\ |
| 181 | \Sigma_3^{(3)} &= |
| 182 | \begin{pmatrix} |
| 183 | 0.43&0.11&0.33&0.07\\ |
| 184 | 0.11&0.12&0.09&0.06\\ |
| 185 | 0.33&0.09&0.36&0.09\\ |
| 186 | 0.07&0.06&0.09&0.09 |
| 187 | \end{pmatrix} |
| 188 | \end{align*} |
| 189 | </div> |
| 190 | |
| 191 | <div class="col-xs-12 borderbottom"> |
| 192 | Penalized log-likelihood (BIC): <b>-562.55</b> |
| 193 | </div> |
| 194 | |
| 195 | <div class="col-xs-12"> |
| 196 | |
| 197 | <p> |
| 198 | As initially stated, the dataset is difficult to cluster because although we know there are |
| 199 | 3 species, 2 of them are almost undinstinguishable. That is why log-likelihood values are very close. |
| 200 | We usually consider that a method is good on Iris dataset when it finds 3 clusters, |
| 201 | but 2 is also a correct answer. |
| 202 | </p> |
| 203 | |
| 204 | </div> |
| 205 | |
| 206 | </div> |
| 207 | |
| 208 | {% endblock %} |
| 209 | |
| 210 | {% block javascripts %} |
| 211 | {{ parent() }} |
| 212 | <script type="text/x-mathjax-config"> |
| 213 | MathJax.Hub.Config({ |
| 214 | tex2jax: { |
| 215 | inlineMath: [['£','£']], |
| 216 | displayMath: [['££','££']], |
| 217 | skipTags: ["script","noscript","style"]//,"textarea","pre","code"] |
| 218 | } |
| 219 | }); |
| 220 | </script> |
| 221 | <script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script> |
| 222 | {% endblock %} |