From: Benjamin Auder <benjamin.auder@somewhere>
Date: Tue, 11 Dec 2018 11:04:07 +0000 (+0100)
Subject: Saving draft vignette state
X-Git-Url: https://git.auder.net/js/pieces/%7B%7B%20asset%28%27mixstore/doc/assets/rpsls.css?a=commitdiff_plain;h=e36b104629f3afa95b5947f28911e276c46aa79f;p=morpheus.git

Saving draft vignette state
---

diff --git a/vignettes/intro.tex b/vignettes/intro.tex
new file mode 100644
index 0000000..2e1ec54
--- /dev/null
+++ b/vignettes/intro.tex
@@ -0,0 +1,74 @@
+\section{Introduction}
+
+
+Logistic models, or  more generally multinomial regression models that fit covariates to discrete responses through a link function, are very popular for use in various application fields. When the data under study come from several groups that have different characteristics, using mixture models is also a very popular way to handle heterogeneity. Thus, many algorithms were developed to deal with various mixtures models, see for instance the book \cite{MR2265601}. Most of them use likelihood methods or Bayesian methods that are likelihood dependent. Indeed, the now well known  expectation-maximization (EM) methodology or its randomized versions makes it often  easy to build algorithms.
+However one problem of such methods is that they can converge to local  spurious maxima so that it is necessary to explore many enough initial points. Recently, spectral methods were developed to bypass EM algorithms and they were proved able to recover the directions of the regression parameter in models with known link function and random covariates, see \cite{AK2014b}. \\
+
+One aim of this paper is to extend such moment methods using least squares to get estimators of the whole parameters, and to provide theoretical guarantees of this estimation method. The setting is that of regression models with binary outputs, random covariates and known link function, detailed in Section 2. We first prove that cross moments up to order $3$  between the output and the regression variables are enough to recover all the parameters of the model, see Theorem \ref{thm1} for the probit link function and Theorem \ref{thm2} for general link functions. We then obtain consistency and asymptotic normality of our least squares estimators as usual, see Theorem \ref{thm3}. The algorithm is described at the end of Section 3, and to apply it, we developed  the R-package \verb"morpheus" available on the CRAN (\cite{Loum_Auder}).
+We then compare experimentally our method to the maximum likelihood estimator computed using the R-package flexmix (\cite{bg-papers:Gruen+Leisch:2007a}). We show that our estimator may be better for the probit link function with finite samples when the dimension increases,  though keeping very small computation times  when that of flexmix increases with dimension. The experiments are presented in Section 4.\\
+
+Another aim of this paper is to investigate identifiability in various mixture of non linear regression models with binary outputs. 
+Indeed,  identifiability results for such models are still few and  not enough to give theoretical guarantees of available algorithms. Let us review what is known up to our knowledge. In \cite{MR1108557}, the identifiability is proved for  finite mixtures of logistic regression models where only the intercept varies with the population 
+\cite{MR3244553}. In \cite{MR2476114}, finite mixtures of multinomial logit models with varying and fixed effects are investigated, the proofs of identifability results use the explicit form of the logit function. In \cite{MR3244553}, further non parametric identifiability of the link function is proved, but only for models where the base exponential models are identifiable for mixtures, which does not apply to binary data (Bernoulli models).\\
+
+We provide in Section 5 several identifiability results, that for example are useful to get theoretical guarantees in applications such as the one in \cite{MR3086415}.
+We prove that with known  smooth enough link function, the directions of the covariates may be recovered under the only assumption that they are distinct, see Theorem \ref{theobis1}. Then, under the strengthened assumption that they are linearly independent, we prove that the link function may be non parametrically recovered, see Theorem \ref{theobis2}. We then study the simultaneous use of continuous and categorical covariates and further give assumptions under which parameters and link function may be recovered, see Theorem \ref{theobis3}. We finally prove that, with longitudinal data having at least $3$ repetitions for each individual, the whole model is identifiable under the weakest assumption that the regression directions are distinct, see 
+Theorem \ref{theobis4}.
+
+
+\section{Model and notations}
+
+Let us denote $[n]$ the set $\lbrace 1,2,\ldots,n\rbrace$ and $e_i\in\mathbb{R}^d,$ the i-th canonical basis vector of $\mathbb{R}^d.$ Denote also $I_d\in\mathbb{R}^{d\times d}$ the identity matrix in $\mathbb{R}^{d}$. The tensor product of $p$ euclidean spaces $\mathbb{R}^{d_i},\,\,i\in [p]$ is noted $\bigotimes_{i=1}^p\mathbb{R}^{d_i}.$ $T$ is called a real p-th order tensor if $T\in \bigotimes_{i=1}^p\mathbb{R}^{d_i}.$ For $p=1,$ $T$ is a vector in $\mathbb{R}^d$ and for $p=2$, $T$ is a $d\times d$ real matrix. The $(i_1,i_2,\ldots,i_p)$-th coordinate of $T$ with respect the canonical basis is denoted   $T[i_1,i_2,\ldots,i_p]$, $ i_1,i_2,\ldots,i_p\in [d].$\\
+
+\noindent
+Let $X\in \R^{d}$ be the vector of covariates and $Y\in \{0,1\}$ be the binary output. \\
+
+\noindent
+A binary regression model assumes that for some link function $g$, the probability that $Y=1$ conditionally to $X=x$ is given by $g(\langle \beta , x \rangle +b)$, where $\beta\in \R^{d}$ is the vector of regression coefficients and $b\in\R$ is the intercept. Popular examples of link functions are the logit link function where for any real $z$,  $g(z)=e^z/(1+e^z)$ and the probit link function where $g(z)=\Phi(z),$  with $\Phi$  the cumulative distribution function of the standard normal ${\cal N}(0,1)$. \\
+If now we want to modelise heterogeneous populations, let $K$ be the number of populations and $\omega=(\omega_1,\cdots,\omega_K)$ their weights such that $\omega_{j}\geq 0$, $j=1,\ldots,K$ and $\sum_{j=1}^{K}\omega{j}=1$. Define, for $j=1,\ldots,K$, the regression coefficients in the $j$-th population by $\beta_{j}\in\R^{d}$ and the intercept in the $j$-th population by $b_{j}\in\R$. Let $\omega =(\omega_{1},\ldots,\omega_{K})$,   $b=(b_1,\cdots,b_K)$, $\beta=[\beta_{1} \vert \cdots,\vert \beta_K]$ the $d\times K$ matrix of regression coefficients and denote $\theta=(\omega,\beta,b)$.
+The model of population mixture of binary regressions is given by:
+\begin{equation}
+\label{mixturemodel1}
+\PP_{\theta}(Y=1\vert X=x)=\sum^{K}_{k=1}\omega_k g(<\beta_k,x>+b_k).
+\end{equation}
+
+\noindent
+We assume that the random variable $X$ has a Gaussian distribution. We now focus on the situation where $X\sim \mathcal{N}(0,I_d)$, $I_d$ being the identity $d\times d$ matrix. All results may be easily extended to the situation where $X\sim \mathcal{N}(m,\Sigma)$, $m\in \R^{d}$, $\Sigma$ a positive and symetric $d\times d$ matrix. \\
+
+\noindent
+Define the cross moments between the response $Y$ and the covariable $X$, up to order $3$:
+
+\begin{itemize}
+\item[--] $M_1(\theta):=\mathbb{E}_{\theta}[Y.X],$ first-order  moment,
+\item[--] $M_2(\theta):=\mathbb{E}_{\theta}\Big[Y.\big(X\otimes X-\sum_{j\in[d]}Y.e_j\otimes e_j\big)\Big],$ second-order moment and
+\item[--] $M_3(\theta):=\mathbb{E}_{\theta}\Big[Y\big(X\otimes X\otimes X-\sum_{j\in[d]}\big[X\otimes e_j\otimes e_j+e_j\otimes X\otimes e_j+e_j\otimes e_j\otimes X\big]\big)\Big]
+ $ third-order moment.
+ \end{itemize}
+
+\noindent
+Let, for $k=1,\ldots,K$, $\lambda_k = \|\beta_k \| $ and $\mu_{k}=\beta_{k}/\|\beta_k \| $.
+Using Stein's identity, Anandkumar et al. (\cite{AK2014b}) prove the following lemma: 
+\begin{lemma}[\cite{AK2014b}]
+\label{lem_rewrite}
+%\begin{eqnarray*}
+%		M_1(\theta)&=&\sum_{k=1}^Kr^{(1)}_k\mu_k%\label{eq01}
+%		\\
+%		M_2(\theta)&=&\sum_{k=1}^Kr^{(2)}_k.\mu_k\otimes \mu_k%\label{eq02}
+%		\\
+%		M_3(\theta)&=&\sum_{k=1}^Kr^{(3)}_k.\mu_k\otimes \mu_k\otimes \mu_k%\label{eq03}
+%\end{eqnarray*}
+%where $r^{(1)}_k=\omega_k\lambda_k\mathbb{E}[g'\big(\lambda_k\langle X,\mu_k\rangle+b_k\big)],$ $r^{(2)}_k=\omega_k\lambda_k^2\mathbb{E}[g''\big(\lambda_k<X,\mu_k>+b_k\big)],$ and $r^{(3)}_k=\omega_k\lambda_k^3\mathbb{E}[g^{(3)}\big(\lambda_k<X,\mu_k>+b_k\big)].$
+%
+%\end{lemma}
+Under enough smoothness and integrability of the link function (which hold for the logit and probit link functions, or under our assumption (H3) below) the moments can be rewritten:
+\begin{eqnarray*}
+		M_1(\theta)&=&\sum_{k=1}^K\omega_k\lambda_k\mathbb{E}[g'\big(\lambda_k\langle X,\mu_k\rangle+b_k\big)]\;\mu_k ,%\label{eq01}
+		\\
+		M_2(\theta)&=&\sum_{k=1}^K\omega_k\lambda_k^2\mathbb{E}[g''\big(\lambda_k<X,\mu_k>+b_k\big)]\;\mu_k\otimes \mu_k ,%\label{eq02}
+		\\
+		M_3(\theta)&=&\sum_{k=1}^K\omega_k\lambda_k^3\mathbb{E}[g^{(3)}\big(\lambda_k<X,\mu_k>+b_k\big)]\;\mu_k\otimes \mu_k\otimes \mu_k .%\label{eq03}
+\end{eqnarray*}
+\end{lemma}
+It is proved in \cite{AK2014b} that the knowledge of $M_{3}(\theta)$ leads to the knowledge of $\mu_{1},\ldots,\mu_{K}$ up to their sign as soon as they are linearly independent. In the next section, we prove that the  knowledge of all cross moments till order 3 allows to recover all parameters for the probit link function under the same assumption on the regression coefficients. We also prove that for a general link function satisfying some weak assumption,  the  knowledge of all cross moments till order 3 allows to recover all parameters provided they are not too far from $0$.
+
+
diff --git a/vignettes/report.Rmd b/vignettes/report.Rmd
index ab00501..2f4a218 100644
--- a/vignettes/report.Rmd
+++ b/vignettes/report.Rmd
@@ -29,10 +29,45 @@ Currently it can handle only binary output $-$ which is a common case.
 
 ## Model
 
+TODO: adapt
 
+Let us denote $[n]$ the set $\lbrace 1,2,\ldots,n\rbrace$ and $e_i\in\mathbb{R}^d,$ the i-th canonical basis vector of $\mathbb{R}^d.$ Denote also $I_d\in\mathbb{R}^{d\times d}$ the identity matrix in $\mathbb{R}^{d}$. The tensor product of $p$ euclidean spaces $\mathbb{R}^{d_i},\,\,i\in [p]$ is noted $\bigotimes_{i=1}^p\mathbb{R}^{d_i}.$ $T$ is called a real p-th order tensor if $T\in \bigotimes_{i=1}^p\mathbb{R}^{d_i}.$ For $p=1,$ $T$ is a vector in $\mathbb{R}^d$ and for $p=2$, $T$ is a $d\times d$ real matrix. The $(i_1,i_2,\ldots,i_p)$-th coordinate of $T$ with respect the canonical basis is denoted   $T[i_1,i_2,\ldots,i_p]$, $ i_1,i_2,\ldots,i_p\in [d].$\\
 
-TODO: retrouver mon texte initial + article.
+\noindent
+Let $X\in \R^{d}$ be the vector of covariates and $Y\in \{0,1\}$ be the binary output. \\
+
+\noindent
+A binary regression model assumes that for some link function $g$, the probability that $Y=1$ conditionally to $X=x$ is given by $g(\langle \beta , x \rangle +b)$, where $\beta\in \R^{d}$ is the vector of regression coefficients and $b\in\R$ is the intercept. Popular examples of link functions are the logit link function where for any real $z$,  $g(z)=e^z/(1+e^z)$ and the probit link function where $g(z)=\Phi(z),$  with $\Phi$  the cumulative distribution function of the standard normal ${\cal N}(0,1)$. \\
+If now we want to modelise heterogeneous populations, let $K$ be the number of populations and $\omega=(\omega_1,\cdots,\omega_K)$ their weights such that $\omega_{j}\geq 0$, $j=1,\ldots,K$ and $\sum_{j=1}^{K}\omega{j}=1$. Define, for $j=1,\ldots,K$, the regression coefficients in the $j$-th population by $\beta_{j}\in\R^{d}$ and the intercept in the $j$-th population by $b_{j}\in\R$. Let $\omega =(\omega_{1},\ldots,\omega_{K})$,   $b=(b_1,\cdots,b_K)$, $\beta=[\beta_{1} \vert \cdots,\vert \beta_K]$ the $d\times K$ matrix of regression coefficients and denote $\theta=(\omega,\beta,b)$.
+The model of population mixture of binary regressions is given by:
+\begin{equation}
+\label{mixturemodel1}
+\PP_{\theta}(Y=1\vert X=x)=\sum^{K}_{k=1}\omega_k g(<\beta_k,x>+b_k).
+\end{equation}
+
+\noindent
+We assume that the random variable $X$ has a Gaussian distribution. We now focus on the situation where $X\sim \mathcal{N}(0,I_d)$, $I_d$ being the identity $d\times d$ matrix. All results may be easily extended to the situation where $X\sim \mathcal{N}(m,\Sigma)$, $m\in \R^{d}$, $\Sigma$ a positive and symetric $d\times d$ matrix. \\
+
+\noindent
 
 2) Algorithm (as in article)
 
+TODO: find it...
+
+The developed R-package is called \verb"morpheus" \cite{Loum_Auder} and divided into two main parts:
+\begin{enumerate}
+	\item the computation of the directions matrix $\mu$, based on the empirical
+		cross-moments as described in the previous sections;
+	\item the optimization of all parameters (including $\mu$), using the initially estimated
+		directions as a starting point.
+\end{enumerate}
+The former is a straightforward translation of the mathematical formulas (file R/computeMu.R),
+while the latter calls R constrOptim() method on the objective function expression and its
+derivative (file R/optimParams.R). For usage examples, please refer to the package help.
+
 3) Experiments: show package usage
+
+\subsection{Experiments}
+In this section, we evaluate our algorithm in a first step using mean squared error (MSE). In a second step, we compare experimentally our moments method (morpheus package \cite{Loum_Auder}) and the likelihood method (with felxmix package \cite{bg-papers:Gruen+Leisch:2007a}). 
+
+TODO.........
diff --git a/vignettes/simul.tex b/vignettes/simul.tex
new file mode 100644
index 0000000..7a36f1b
--- /dev/null
+++ b/vignettes/simul.tex
@@ -0,0 +1,218 @@
+\section{Simulations}
+
+\subsection{R package}
+
+The developed R-package is called \verb"morpheus" \cite{Loum_Auder} and divided into two main parts:
+\begin{enumerate}
+	\item the computation of the directions matrix $\mu$, based on the empirical
+		cross-moments as described in the previous sections;
+	\item the optimization of all parameters (including $\mu$), using the initially estimated
+		directions as a starting point.
+\end{enumerate}
+The former is a straightforward translation of the mathematical formulas (file R/computeMu.R),
+while the latter calls R constrOptim() method on the objective function expression and its
+derivative (file R/optimParams.R). For usage examples, please refer to the package help.
+
+\subsection{Experiments}
+In this section, we evaluate our algorithm in a first step using mean squared error (MSE). In a second step, we compare experimentally our moments method (morpheus package \cite{Loum_Auder}) and the likelihood method (with felxmix package \cite{bg-papers:Gruen+Leisch:2007a}). 
+We arbitrarily choose the parameters for the simulations, which should be discovered by the algorithms (ours, and the likelihood algorithm).
+
+Experiment 1 (dimension 2):
+\begin{align*}
+	K &= 2\\
+	p &= (0.5, 0.5)\\
+	b &= (-0.2, 0.5)\\
+	\beta &=
+		\begin{pmatrix}
+		1 & 3\\
+		-2 & 1
+		\end{pmatrix}
+\end{align*}
+
+Experiment 2 (dimension 5):
+\begin{align*}
+	K &= 2\\
+	p &= (0.5, 0.5)\\
+	b &= (-0.2, 0.5)\\
+	\beta &=
+		\begin{pmatrix}
+			1 & 2\\
+			2 & -3\\
+			-1 & 0\\
+			0 & 1\\
+			3 & 0
+		\end{pmatrix}
+\end{align*}
+
+Experiment 3 (dimension 10):
+\begin{align*}
+	K &= 3\\
+	p &= (0.3, 0.3, 0.4)\\
+	b &= (-0.2, 0, 0.5)\\
+	\beta &=
+		\begin{pmatrix}
+			1 & 2 & -1\\
+			2 & -3 & 1\\
+			-1 & 0 & 3\\
+			0 & 1 & -1\\
+			3 & 0 & 0\\
+			4 & -1 & 0\\
+			-1 & -4 & 2\\
+			-3 & 3 & 0\\
+			0 & 2 & 1\\
+			2 & 0 & -2
+		\end{pmatrix}
+\end{align*}
+
+For all three experiments we use both logit and probit links.
+Computations are always run on the same data both for our package and flexmix $-$ which is a reference for this kind of estimation, using an iterative algorithm to maximize the log-likelihood. Results are aggregated over $N=1000$ Monte-Carlo runs.\\
+
+
+\noindent \textbf{Mean squared error (MSE)}.
+Graphical representations of the MSE versus the sample size ($n$) are given in figures~\ref{mse_2}, ~\ref{mse_5} and \ref{mse_10}. In each figure, we represent the MSE associated with each parameter vector versus the sample size. We can see that the goodness of the estimation depends on  the sample size and that enough observations is needed to properly estimate the parameters. We have from figure~\ref{mse_2} and figure~\ref{mse_5} that for our moment method, with dimension less or equal to $5,$ the necessary sample size is around of $10^5.$ For large dimension, figures~\ref{mse_10} show that $10^6$ is not enough to estimate the parameters vectors $\beta.$ Indeed our estimation method use the spectral estimator, which need more enough data, as starting point.\\
+
+
+
+
+
+
+\begin{figure}[!!ht!!]
+\centering
+	\includegraphics[keepaspectratio=true,height=3.5cm,width=8cm]{logit_2.png}\\
+	\includegraphics[keepaspectratio=true,height=3.5cm,width=8cm]{probit_2.png}
+	\caption{Experiment 1; Top: \textit{logit} link function $\Big($(a) MSE($\widehat{\beta}$), (b) MSE($\widehat{b}$,$\widehat{p}$)$\Big)$, bottom: \textit{probit} link function: $\Big($(c) MSE$(\widehat{\beta})$, (d) MSE($(\widehat{b}$,$\widehat{p}$)$\Big)$.}
+	\label{mse_2}
+\end{figure}
+
+\begin{figure}[!!ht!!]
+\centering
+	\includegraphics[keepaspectratio=true,height=3.5cm,width=8cm]{logit_5.png}\\
+	\includegraphics[keepaspectratio=true,height=3.5cm,width=8cm]{probit_5.png}
+	\caption{Experiment 2; Top: \textit{logit} link function $\Big($(a) MSE($\widehat{\beta}$), (b) MSE($\widehat{b}$,$\widehat{p}$)$\Big)$, bottom: \textit{probit} link function: $\Big($ (c) MSE$(\widehat{\beta})$, (d) MSE($(\widehat{b}$,$\widehat{p}$)$\Big)$.}
+	\label{mse_5}
+\end{figure}
+
+
+\begin{figure}[!!ht!!]
+\centering
+	\includegraphics[keepaspectratio=true,height=3.5cm,width=8cm]{logit_10.png}\\
+	\includegraphics[keepaspectratio=true,height=3.5cm,width=8cm]{probit_10.png}
+	\caption{Experiment 3; Top: \textit{logit} link function $\Big($(a) MSE($\widehat{\beta}$), (b) MSE($\widehat{b}$,$\widehat{p}$)$\Big)$, bottom: \textit{probit} link function: $\Big($ (c) MSE$(\widehat{\beta})$, (d) MSE($(\widehat{b}$,$\widehat{p}$)$\Big)$.}
+	\label{mse_10}
+\end{figure}
+
+%\begin{figure}[!!ht!!]
+%\centering
+%	\includegraphics[keepaspectratio=true,height=4cm,width=8cm]{Figures/logit_2.png}
+%	\caption{Experiment 1: \textit{logit} link function for our algorithm; (a) MSE($\widehat{\beta}$), (b) MSE($\widehat{b}$,$\widehat{p}$).}
+%	\label{mse_logit_2}
+%\end{figure}
+%
+%\begin{figure}[!!ht!!]
+%\centering
+%	\includegraphics[keepaspectratio=true,height=4cm,width=8cm]{Figures/logit_5.png}
+%	\caption{Experiment 2: \textit{logit} link functionfor our algorithm; (a) MSE($\widehat{\beta}$), (b) MSE($\widehat{b}$,$\widehat{p}$).}
+%	\label{mse_logit_5}
+%\end{figure}
+%
+%
+%\begin{figure}[!!ht!!]
+%\centering
+%	\includegraphics[keepaspectratio=true,height=4cm,width=8cm]{Figures/logit_10.png}
+%	\caption{Experiment 3: \textit{logit} link function for our algorithm; (a) MSE($\widehat{\beta}$), (b) MSE($\widehat{b}$,$\widehat{p}$).}
+%	\label{mse_logit_10}
+%\end{figure}
+%
+%\begin{figure}[!!ht!!]
+%\centering
+%	\includegraphics[keepaspectratio=true,height=4cm,width=8cm]{Figures/probit_2.png}
+%	\caption{Experiment 1: \textit{probit} link function for our algorithm; (a) MSE($\widehat{\beta}$), (b) MSE($\widehat{b}$,$\widehat{p}$).}
+%	\label{mse_probit_2}
+%\end{figure}
+%
+%\begin{figure}[!!ht!!]
+%\centering
+%	\includegraphics[keepaspectratio=true,height=4cm,width=8cm]{Figures/probit_5.png}
+%	\caption{Experiment 2: \textit{probit} link function for our algorithm; (a) MSE($\widehat{\beta}$), (b) MSE($\widehat{b}$,$\widehat{p}$).}
+%	\label{mse_probit_5}
+%\end{figure}
+%
+%\begin{figure}[!!ht!!]
+%\centering
+%	\includegraphics[keepaspectratio=true,height=4cm,width=8cm]{Figures/probit_10.png}
+%	\caption{Experiment 3: \textit{probit} link function for our algorithm; (a) MSE($\widehat{\beta}$), (b) MSE($\widehat{b}$,$\widehat{p}$).}
+%	\label{mse_probit_10}
+%\end{figure}
+
+
+
+
+
+\noindent \textbf{Algorithms performance}.
+To evaluate algorithms performance, the total number of sample points is fixed to $n = 10^5.$ This value still enough to observe correct performances, yet small enough to remain realistic.
+
+On the figures~\ref{res_logit} and \ref{res_probit}, all (true) parameters ($p, b, \beta$) are re-ordered in a real vector, of size $K \times (d+2) - 1$. This vector is plotted as a line to improve visualization experience, but it must be noted that this does not represent any curve data. Dotted lines corresponds to the computed values plus or minus one standard deviation, and computed values themselves are represented with long dashed lines. The leftmost column corresponds to experiment 1 ($d=2$), the middle one to experiment 2 ($d=5$), and the rightmost column corresponds to experiment 3 ($d=10$).
+
+
+
+
+\noindent \textit{Figure \ref{res_logit}: logit link.}
+While most of the times flexmix finds a better solution than our proposed algorithm (smaller variance), both methods are good on average for $d \leq 5$.
+The case $d=10$ is not handled well neither by the flexmix package nor by our package (the latter showing even poorer accuracy).
+Indeed in this relatively high dimension the number of observations should be much higher. 
+
+\begin{figure}[!!ht!!]
+\centering
+	\includegraphics[keepaspectratio=true,height=8cm,width=12cm]{res_logit.png}
+	\caption{\textit{Logit} link function. Top: our package, bottom: flexmix. From left to right: experiment 1, 2 and 3 respectively}
+	\label{res_logit}
+\end{figure}
+\noindent \textit{Figure \ref{res_probit}: probit link.}
+In this case our algorithm performs slightly better than its flexmix counterpart for $d \leq 5$.
+However, again, the variance in the case $d=10$ is way too high $-$ in fact even the average value is generally wrong, when coefficients are non-zero.
+We can increase $n$ by a factor $100$ to obtain more accurate results.\\
+\begin{figure}[!!ht!!]
+\centering
+	\includegraphics[keepaspectratio=true,height=8cm,width=12cm]{res_probit.png}
+	\caption{\textit{Probit} link function. Top: our package, bottom: flexmix. From left to right: experiment 1, 2 and 3 respectively}
+	\label{res_probit}
+\end{figure}
+
+
+
+% TODO: improve placement: see https://tex.stackexchange.com/questions/8625/force-figure-placement-in-text
+
+Considering both links with $d \leq 5$, the algorithms performances are comparable with a global small advantage to our method.
+The case $d=10$ is handled better by the flexmix algorithm, although clearly not well.
+Finally, concerning our method we observe a tendancy to overestimate large parameters while underestimating small ones.
+This observation is inversed for flexmix.\\
+
+\noindent \textbf{Computational time}.
+It must be noted that our algorithm timing does not depend on $n$, since it operates on matrices of size $O(d \times K)$.
+Thus it can be more suitable for very large datasets, where the variability is clearly reduced. Figure~\ref{timing} (time by seconds versus $\log_{10}n$) illustrate this fact: the timings clearly favor our package $-$ because increasing $n$
+has almost no impact on the running time. However, flexmix timings are not that high:
+just a few minutes for the longest run, on average, on one million sample points.
+To obtain the data shown on the figure we averaged $1000$ runs on random parameters.\\
+
+\begin{figure}
+\centering
+	\includegraphics[height=8cm,width=10cm]{timings.png}
+	\caption{Timings versus sample size. Dotted lines: flexmix; others: our algorithm. Black for $d=2,$ red for $d=5$ and blue for $d=10.$}
+	\label{timing}
+\end{figure}
+
+
+%\begin{figure}[h!]
+%\centering
+%	\includegraphics[height=6cm,width=14cm]{Figures/timings_logit.png}
+%	\caption{\textit{Logit} link function. Timings versus sample size. Our algorithm on the left, flexmix on the right. $d=2$ in long-dashed line, $d=5$ in dotted line, $d=10$ in solid line.}
+%	\label{timing_logit}
+%\end{figure}
+%
+%
+%\begin{figure}[h!]
+%\centering
+%	\includegraphics[height=6cm,width=14cm]{Figures/timings_probit.png}
+%	\caption{\textit{Probit} link function. Timings versus sample size. Our algorithm on the left, flexmix on the right. $d=2$ in long-dashed line, $d=5$ in dotted line, $d=10$ in solid line.}
+%	\label{timing_probit}
+%\end{figure}
\ No newline at end of file