vignettes/intro.tex

   1 \section{Introduction}
   2
   3
   4 Logistic models, or  more generally multinomial regression models that fit covariates to discrete responses through a link function, are very popular for use in various application fields. When the data under study come from several groups that have different characteristics, using mixture models is also a very popular way to handle heterogeneity. Thus, many algorithms were developed to deal with various mixtures models, see for instance the book \cite{MR2265601}. Most of them use likelihood methods or Bayesian methods that are likelihood dependent. Indeed, the now well known  expectation-maximization (EM) methodology or its randomized versions makes it often  easy to build algorithms.
   5 However one problem of such methods is that they can converge to local  spurious maxima so that it is necessary to explore many enough initial points. Recently, spectral methods were developed to bypass EM algorithms and they were proved able to recover the directions of the regression parameter in models with known link function and random covariates, see \cite{AK2014b}. \\
   6
   7 One aim of this paper is to extend such moment methods using least squares to get estimators of the whole parameters, and to provide theoretical guarantees of this estimation method. The setting is that of regression models with binary outputs, random covariates and known link function, detailed in Section 2. We first prove that cross moments up to order $3$  between the output and the regression variables are enough to recover all the parameters of the model, see Theorem \ref{thm1} for the probit link function and Theorem \ref{thm2} for general link functions. We then obtain consistency and asymptotic normality of our least squares estimators as usual, see Theorem \ref{thm3}. The algorithm is described at the end of Section 3, and to apply it, we developed  the R-package \verb"morpheus" available on the CRAN (\cite{Loum_Auder}).
   8 We then compare experimentally our method to the maximum likelihood estimator computed using the R-package flexmix (\cite{bg-papers:Gruen+Leisch:2007a}). We show that our estimator may be better for the probit link function with finite samples when the dimension increases,  though keeping very small computation times  when that of flexmix increases with dimension. The experiments are presented in Section 4.\\
   9
  10 Another aim of this paper is to investigate identifiability in various mixture of non linear regression models with binary outputs.
  11 Indeed,  identifiability results for such models are still few and  not enough to give theoretical guarantees of available algorithms. Let us review what is known up to our knowledge. In \cite{MR1108557}, the identifiability is proved for  finite mixtures of logistic regression models where only the intercept varies with the population
  12 \cite{MR3244553}. In \cite{MR2476114}, finite mixtures of multinomial logit models with varying and fixed effects are investigated, the proofs of identifability results use the explicit form of the logit function. In \cite{MR3244553}, further non parametric identifiability of the link function is proved, but only for models where the base exponential models are identifiable for mixtures, which does not apply to binary data (Bernoulli models).\\
  13
  14 We provide in Section 5 several identifiability results, that for example are useful to get theoretical guarantees in applications such as the one in \cite{MR3086415}.
  15 We prove that with known  smooth enough link function, the directions of the covariates may be recovered under the only assumption that they are distinct, see Theorem \ref{theobis1}. Then, under the strengthened assumption that they are linearly independent, we prove that the link function may be non parametrically recovered, see Theorem \ref{theobis2}. We then study the simultaneous use of continuous and categorical covariates and further give assumptions under which parameters and link function may be recovered, see Theorem \ref{theobis3}. We finally prove that, with longitudinal data having at least $3$ repetitions for each individual, the whole model is identifiable under the weakest assumption that the regression directions are distinct, see
  16 Theorem \ref{theobis4}.
  17
  18
  19 \section{Model and notations}
  20
  21 Let us denote $[n]$ the set $\lbrace 1,2,\ldots,n\rbrace$ and $e_i\in\mathbb{R}^d,$ the i-th canonical basis vector of $\mathbb{R}^d.$ Denote also $I_d\in\mathbb{R}^{d\times d}$ the identity matrix in $\mathbb{R}^{d}$. The tensor product of $p$ euclidean spaces $\mathbb{R}^{d_i},\,\,i\in [p]$ is noted $\bigotimes_{i=1}^p\mathbb{R}^{d_i}.$ $T$ is called a real p-th order tensor if $T\in \bigotimes_{i=1}^p\mathbb{R}^{d_i}.$ For $p=1,$ $T$ is a vector in $\mathbb{R}^d$ and for $p=2$, $T$ is a $d\times d$ real matrix. The $(i_1,i_2,\ldots,i_p)$-th coordinate of $T$ with respect the canonical basis is denoted   $T[i_1,i_2,\ldots,i_p]$, $ i_1,i_2,\ldots,i_p\in [d].$\\
  22
  23 \noindent
  24 Let $X\in \R^{d}$ be the vector of covariates and $Y\in \{0,1\}$ be the binary output. \\
  25
  26 \noindent
  27 A binary regression model assumes that for some link function $g$, the probability that $Y=1$ conditionally to $X=x$ is given by $g(\langle \beta , x \rangle +b)$, where $\beta\in \R^{d}$ is the vector of regression coefficients and $b\in\R$ is the intercept. Popular examples of link functions are the logit link function where for any real $z$,  $g(z)=e^z/(1+e^z)$ and the probit link function where $g(z)=\Phi(z),$  with $\Phi$  the cumulative distribution function of the standard normal ${\cal N}(0,1)$. \\
  28 If now we want to modelise heterogeneous populations, let $K$ be the number of populations and $\omega=(\omega_1,\cdots,\omega_K)$ their weights such that $\omega_{j}\geq 0$, $j=1,\ldots,K$ and $\sum_{j=1}^{K}\omega{j}=1$. Define, for $j=1,\ldots,K$, the regression coefficients in the $j$-th population by $\beta_{j}\in\R^{d}$ and the intercept in the $j$-th population by $b_{j}\in\R$. Let $\omega =(\omega_{1},\ldots,\omega_{K})$,   $b=(b_1,\cdots,b_K)$, $\beta=[\beta_{1} \vert \cdots,\vert \beta_K]$ the $d\times K$ matrix of regression coefficients and denote $\theta=(\omega,\beta,b)$.
  29 The model of population mixture of binary regressions is given by:
  30 \begin{equation}
  31 \label{mixturemodel1}
  32 \PP_{\theta}(Y=1\vert X=x)=\sum^{K}_{k=1}\omega_k g(<\beta_k,x>+b_k).
  33 \end{equation}
  34
  35 \noindent
  36 We assume that the random variable $X$ has a Gaussian distribution. We now focus on the situation where $X\sim \mathcal{N}(0,I_d)$, $I_d$ being the identity $d\times d$ matrix. All results may be easily extended to the situation where $X\sim \mathcal{N}(m,\Sigma)$, $m\in \R^{d}$, $\Sigma$ a positive and symetric $d\times d$ matrix. \\
  37
  38 \noindent
  39 Define the cross moments between the response $Y$ and the covariable $X$, up to order $3$:
  40
  41 \begin{itemize}
  42 \item[--] $M_1(\theta):=\mathbb{E}_{\theta}[Y.X],$ first-order  moment,
  43 \item[--] $M_2(\theta):=\mathbb{E}_{\theta}\Big[Y.\big(X\otimes X-\sum_{j\in[d]}Y.e_j\otimes e_j\big)\Big],$ second-order moment and
  44 \item[--] $M_3(\theta):=\mathbb{E}_{\theta}\Big[Y\big(X\otimes X\otimes X-\sum_{j\in[d]}\big[X\otimes e_j\otimes e_j+e_j\otimes X\otimes e_j+e_j\otimes e_j\otimes X\big]\big)\Big]
  45  $ third-order moment.
  46  \end{itemize}
  47
  48 \noindent
  49 Let, for $k=1,\ldots,K$, $\lambda_k = \|\beta_k \| $ and $\mu_{k}=\beta_{k}/\|\beta_k \| $.
  50 Using Stein's identity, Anandkumar et al. (\cite{AK2014b}) prove the following lemma:
  51 \begin{lemma}[\cite{AK2014b}]
  52 \label{lem_rewrite}
  53 %\begin{eqnarray*}
  54 %       M_1(\theta)&=&\sum_{k=1}^Kr^{(1)}_k\mu_k%\label{eq01}
  55 %       \\
  56 %       M_2(\theta)&=&\sum_{k=1}^Kr^{(2)}_k.\mu_k\otimes \mu_k%\label{eq02}
  57 %       \\
  58 %       M_3(\theta)&=&\sum_{k=1}^Kr^{(3)}_k.\mu_k\otimes \mu_k\otimes \mu_k%\label{eq03}
  59 %\end{eqnarray*}
  60 %where $r^{(1)}_k=\omega_k\lambda_k\mathbb{E}[g'\big(\lambda_k\langle X,\mu_k\rangle+b_k\big)],$ $r^{(2)}_k=\omega_k\lambda_k^2\mathbb{E}[g''\big(\lambda_k<X,\mu_k>+b_k\big)],$ and $r^{(3)}_k=\omega_k\lambda_k^3\mathbb{E}[g^{(3)}\big(\lambda_k<X,\mu_k>+b_k\big)].$
  61 %
  62 %\end{lemma}
  63 Under enough smoothness and integrability of the link function (which hold for the logit and probit link functions, or under our assumption (H3) below) the moments can be rewritten:
  64 \begin{eqnarray*}
  65         M_1(\theta)&=&\sum_{k=1}^K\omega_k\lambda_k\mathbb{E}[g'\big(\lambda_k\langle X,\mu_k\rangle+b_k\big)]\;\mu_k ,%\label{eq01}
  66         \\
  67         M_2(\theta)&=&\sum_{k=1}^K\omega_k\lambda_k^2\mathbb{E}[g''\big(\lambda_k<X,\mu_k>+b_k\big)]\;\mu_k\otimes \mu_k ,%\label{eq02}
  68         \\
  69         M_3(\theta)&=&\sum_{k=1}^K\omega_k\lambda_k^3\mathbb{E}[g^{(3)}\big(\lambda_k<X,\mu_k>+b_k\big)]\;\mu_k\otimes \mu_k\otimes \mu_k .%\label{eq03}
  70 \end{eqnarray*}
  71 \end{lemma}
  72 It is proved in \cite{AK2014b} that the knowledge of $M_{3}(\theta)$ leads to the knowledge of $\mu_{1},\ldots,\mu_{K}$ up to their sign as soon as they are linearly independent. In the next section, we prove that the  knowledge of all cross moments till order 3 allows to recover all parameters for the probit link function under the same assumption on the regression coefficients. We also prove that for a general link function satisfying some weak assumption,  the  knowledge of all cross moments till order 3 allows to recover all parameters provided they are not too far from $0$.
  73
  74