Commit | Line | Data |
---|---|---|
3d5b5060 | 1 | --- |
c83df166 | 2 | title: Use morpheus package |
3d5b5060 BA |
3 | |
4 | output: | |
5 | pdf_document: | |
6 | number_sections: true | |
7 | toc_depth: 1 | |
8 | --- | |
9 | ||
10 | ```{r setup, results="hide", include=FALSE} | |
11 | knitr::opts_chunk$set(echo = TRUE, include = TRUE, | |
12 | cache = TRUE, comment="", cache.lazy = FALSE, | |
13 | out.width = "100%", fig.align = "center") | |
14 | ``` | |
15 | ||
c83df166 BA |
16 | ## Introduction |
17 | <!--Tell that we try to learn classification parameters in a non-EM way, using algebric manipulations.--> | |
3d5b5060 | 18 | |
dad25cd2 BA |
19 | *morpheus* is a contributed R package which attempts to find the parameters of a |
20 | mixture of logistic classifiers. | |
21 | When the data under study come from several groups that have different characteristics, | |
22 | using mixture models is a very popular way to handle heterogeneity. | |
23 | Thus, many algorithms were developed to deal with various mixtures models. | |
24 | Most of them use likelihood methods or Bayesian methods that are likelihood dependent. | |
cff1083b | 25 | *flexmix* is an R package which implements these kinds of algorithms. |
3d5b5060 | 26 | |
dad25cd2 BA |
27 | However, one problem of such methods is that they can converge to local maxima, |
28 | so several starting points must be explored. | |
29 | Recently, spectral methods were developed to bypass EM algorithms and they were proved | |
30 | able to recover the directions of the regression parameter | |
c83df166 | 31 | in models with known link function and random covariates (see [XX]). |
dad25cd2 BA |
32 | Our package extends such moment methods using least squares to get estimators of the |
33 | whole parameters (with theoretical garantees, see [XX]). | |
cff1083b | 34 | Currently it can handle only binary output $-$ which is a common case. |
3d5b5060 | 35 | |
c83df166 BA |
36 | ## Model |
37 | ||
dad25cd2 BA |
38 | Let $X\in \R^{d}$ be the vector of covariates and $Y\in \{0,1\}$ be the binary output. |
39 | A binary regression model assumes that for some link function $g$, the probability that | |
40 | $Y=1$ conditionally to $X=x$ is given by $g(\langle \beta, x \rangle +b)$, where | |
41 | $\beta\in \R^{d}$ is the vector of regression coefficients and $b\in\R$ is the intercept. | |
42 | Popular examples of link functions are the logit link function where for any real $z$, | |
43 | $g(z)=e^z/(1+e^z)$ and the probit link function where $g(z)=\Phi(z),$ with $\Phi$ | |
44 | the cumulative distribution function of the standard normal ${\cal N}(0,1)$. | |
45 | Both are implemented in the package. | |
46 | ||
47 | If now we want to modelise heterogeneous populations, let $K$ be the number of | |
48 | populations and $\omega=(\omega_1,\cdots,\omega_K)$ their weights such that | |
49 | $\omega_{j}\geq 0$, $j=1,\ldots,K$ and $\sum_{j=1}^{K}\omega{j}=1$. | |
50 | Define, for $j=1,\ldots,K$, the regression coefficients in the $j$-th population | |
51 | by $\beta_{j}\in\R^{d}$ and the intercept in the $j$-th population by | |
52 | $b_{j}\in\R$. Let $\omega =(\omega_{1},\ldots,\omega_{K})$, | |
53 | $b=(b_1,\cdots,b_K)$, $\beta=[\beta_{1} \vert \cdots,\vert \beta_K]$ the $d\times K$ | |
54 | matrix of regression coefficients and denote $\theta=(\omega,\beta,b)$. | |
e36b1046 | 55 | The model of population mixture of binary regressions is given by: |
dad25cd2 | 56 | |
e36b1046 BA |
57 | \begin{equation} |
58 | \label{mixturemodel1} | |
59 | \PP_{\theta}(Y=1\vert X=x)=\sum^{K}_{k=1}\omega_k g(<\beta_k,x>+b_k). | |
60 | \end{equation} | |
61 | ||
dad25cd2 | 62 | ## Algorithm, theoretical garantees |
e36b1046 | 63 | |
dad25cd2 BA |
64 | The algorithm uses spectral properties of some tensor matrices to estimate the model |
65 | parameters $\Theta = (\omega, \beta, b)$. Under rather mild conditions it can be | |
66 | proved that the algorithm converges to the correct values (its speed is known too). | |
67 | For more informations on that subject, however, please refer to our article [XX]. | |
68 | In this vignette let's rather focus on package usage. | |
3d5b5060 | 69 | |
dad25cd2 | 70 | ## Usage |
85e0343a BA |
71 | <!--We assume that the random variable $X$ has a Gaussian distribution. |
72 | We now focus on the situation where $X\sim \mathcal{N}(0,I_d)$, $I_d$ being the | |
73 | identity $d\times d$ matrix. All results may be easily extended to the situation | |
74 | where $X\sim \mathcal{N}(m,\Sigma)$, $m\in \R^{d}$, $\Sigma$ a positive and | |
75 | symetric $d\times d$ matrix. ***** TODO: take this into account? --> | |
e36b1046 | 76 | |
5859426b BA |
77 | The two main functions are: |
78 | * computeMu(), which estimates the parameters directions, and | |
79 | * optimParams(), which builds an object \code{o} to estimate all other parameters | |
80 | when calling \code{o$run()}, starting from the directions obtained by the | |
81 | previous function. | |
82 | A third function is useful to run Monte-Carlo or bootstrap estimations using | |
83 | different models in various contexts: multiRun(). We'll show example for all of them. | |
e36b1046 | 84 | |
5859426b | 85 | ### Estimation of directions |
e36b1046 | 86 | |
5859426b BA |
87 | In a real situation you would have (maybe after some pre-processing) the matrices |
88 | X and Y which contain vector inputs and binary output. | |
89 | However, a function is provided in the package to generate such data following a | |
90 | pre-defined law: | |
e36b1046 | 91 | |
d294ece1 BA |
92 | ```{r, results="show", include=TRUE, echo=TRUE} |
93 | library(morpheus) | |
5859426b | 94 | io <- generateSampleIO(n=10000, p=1/2, beta=matrix(c(1,0,0,1),ncol=2), b=c(0,0), link="probit") |
d294ece1 BA |
95 | # io$X and io$Y contain the sample data |
96 | ``` | |
5859426b | 97 | |
d294ece1 BA |
98 | $n$ is the total number of samples (lines in X, number of elements in Y) |
99 | $p$ is a vector of proportions, of size $d-1$ (because the last proportion is deduced | |
100 | from the others: $p$ elements sums to 1) [TODO: omega or p?] | |
101 | $\beta$ is the matrix of linear coefficients, as written above in the model. | |
102 | $b$ is the vector of intercepts (as in linear regression, and as in the model above) | |
5859426b BA |
103 | link can be either "logit" or "probit", as mentioned earlier. |
104 | ||
105 | This function outputs a list containing in particular the matrices X and Y, allowing to | |
106 | use the other functions (which all require either these, or the moments). | |
107 | ||
d294ece1 BA |
108 | ```{r, results="show", include=TRUE, echo=TRUE} |
109 | mu <- computeMu(io$X, io$Y, optargs=list(K=2)) | |
110 | ``` | |
5859426b | 111 | |
d294ece1 BA |
112 | The optional argument, "optargs", is a list which can provide |
113 | * the number of clusters $K$, | |
114 | * the moments matrix $M$ (computed with the "computeMoments()" function), | |
115 | * the joint-diagonalisation method ("uwedge" or "jedi"), | |
116 | * the number of random vectors for joint-diagonalization. | |
117 | See ?computeMu and the code for more details. | |
118 | ||
119 | ### Estimation of the other parameters | |
120 | ||
121 | The other parameters are estimated by solving an optimization problem. | |
122 | The following function builds and return an optimization algorithm object: | |
123 | ||
124 | ```{r, results="show", include=TRUE, echo=TRUE} | |
125 | M <- computeMoments(io$X, io$Y) | |
126 | # X and Y must be provided if the moments matrix is not given | |
127 | algopt <- optimParams(K=2, link="probit", optargs=list(M=M)) | |
128 | # Optimization starts at beta = mu, b = 0 and p = uniform distribution | |
129 | x0 <- list(beta = mu) | |
130 | theta <- algopt$run(x0) | |
131 | ``` | |
5859426b | 132 | |
d294ece1 BA |
133 | Now theta is a list with three slots: |
134 | * $p$: estimated proportions, | |
135 | * $\beta$: estimated regression matrix, | |
136 | * $b$: estimated bias. | |
5859426b BA |
137 | |
138 | ### Monte-Carlo and bootstrap | |
139 | ||
d294ece1 BA |
140 | The package provides a function to compare methods on several computations on random data. |
141 | It takes in input a list of parameters, then a list of functions which output some quantities | |
142 | (on the first example, our "computeMu()" method versus flexmix way of estimating directions), | |
143 | and finally a method to prepare the arguments which will be given to the functions in the | |
144 | list just mentioned; this allows to run Monte-Carlo estimations with the exact same samples | |
145 | for each compared method. The two last arguments to "multiRun()" control the number of runs, | |
146 | and the number of cores (using the package parallel). | |
147 | ||
148 | ```{r, results="show", include=TRUE, echo=TRUE} | |
149 | beta <- matrix(c(1,-2,3,1), ncol=2) | |
150 | io <- generateSampleIO(n=1000, p=1/2, beta=beta, b=c(0,0), "logit") | |
151 | mu <- normalize(beta) | |
152 | ||
153 | # Example 1: bootstrap + computeMu, morpheus VS flexmix; assumes fargs first 3 elts X,Y,K | |
154 | mr1 <- multiRun(list(X=io$X,Y=io$Y,optargs=list(K=2,jd_nvects=0)), list( | |
155 | # morpheus | |
156 | function(fargs) { | |
157 | library(morpheus) | |
158 | ind <- fargs$ind | |
159 | computeMu(fargs$X[ind,],fargs$Y[ind],fargs$optargs) | |
160 | }, | |
161 | # flexmix | |
162 | function(fargs) { | |
163 | library(flexmix) | |
164 | source("../patch_Bettina/FLXMRglm.R") | |
165 | ind <- fargs$ind | |
166 | K <- fargs$optargs$K | |
167 | dat = as.data.frame( cbind(fargs$Y[ind],fargs$X[ind,]) ) | |
168 | out = refit( flexmix( cbind(V1, 1 - V1) ~ 0+., data=dat, k=K, | |
169 | model=FLXMRglm(family="binomial") ) ) | |
170 | normalize( matrix(out@coef[1:(ncol(fargs$X)*K)], ncol=K) ) | |
171 | } ), | |
172 | prepareArgs = function(fargs,index) { | |
173 | # Always include the non-shuffled dataset | |
174 | if (index == 1) | |
175 | fargs$ind <- 1:nrow(fargs$X) | |
176 | else | |
177 | fargs$ind <- sample(1:nrow(fargs$X),replace=TRUE) | |
178 | fargs | |
179 | }, N=10, ncores=3) | |
180 | # The result is correct up to matrices columns permutations; align them: | |
181 | for (i in 1:2) | |
182 | mr1[[i]] <- alignMatrices(mr1[[i]], ref=mu, ls_mode="exact") | |
183 | ``` | |
184 | ||
185 | Several plots are available: histograms, boxplots, or curves of coefficients. | |
186 | We illustrate boxplots and curves here (histograms function uses the same arguments, | |
187 | see ?plotHist). | |
188 | ||
189 | ``` | |
190 | # Second row, first column; morpheus on the left, flexmix on the right | |
191 | plotBox(mr1, 2, 1, "Target value: -1") | |
192 | ``` | |
193 | ||
194 | ```{r, results="show", include=TRUE, echo=TRUE} | |
195 | # Example 2: Monte-Carlo + optimParams from X,Y, morpheus VS flexmix; first args n,p,beta,b | |
196 | mr2 <- multiRun(list(n=1000,p=1/2,beta=beta,b=c(0,0),optargs=list(link="logit")), list( | |
197 | # morpheus | |
198 | function(fargs) { | |
199 | library(morpheus) | |
200 | mu <- computeMu(fargs$X, fargs$Y, fargs$optargs) | |
201 | optimParams(fargs$K,fargs$link,fargs$optargs)$run(list(beta=mu))$beta | |
202 | }, | |
203 | # flexmix | |
204 | function(fargs) { | |
205 | library(flexmix) | |
206 | source("../patch_Bettina/FLXMRglm.R") | |
207 | dat <- as.data.frame( cbind(fargs$Y,fargs$X) ) | |
208 | out <- refit( flexmix( cbind(V1, 1 - V1) ~ 0+., data=dat, k=fargs$K, | |
209 | model=FLXMRglm(family="binomial") ) ) | |
210 | sapply( seq_len(fargs$K), function(i) as.double( out@components[[1]][[i]][,1] ) ) | |
211 | } ), | |
212 | prepareArgs = function(fargs,index) { | |
213 | library(morpheus) | |
214 | io = generateSampleIO(fargs$n, fargs$p, fargs$beta, fargs$b, fargs$optargs$link) | |
215 | fargs$X = io$X | |
216 | fargs$Y = io$Y | |
217 | fargs$K = ncol(fargs$beta) | |
218 | fargs$link = fargs$optargs$link | |
219 | fargs$optargs$M = computeMoments(io$X,io$Y) | |
220 | fargs | |
221 | }, N=10, ncores=3) | |
222 | # As in example 1, align results: | |
223 | for (i in 1:2) | |
224 | mr2[[i]] <- alignMatrices(mr2[[i]], ref=beta, ls_mode="exact") | |
225 | ``` | |
226 | ||
227 | ``` | |
228 | # Second argument = true parameters matrix; third arg = index of method (here "morpheus") | |
229 | plotCoefs(mr2, beta, 1) | |
230 | # Real params are on the continous line; estimations = dotted line | |
231 | ``` |