improve/fix comments - TODO: debug examples, CSV and after
[epclust.git] / contrat / 2016_IRSDIproject_v3.tex
1 \documentclass[12pt, a4paper]{article}
2
3 \usepackage[margin=2.5cm]{geometry}
4 \usepackage[utf8]{inputenc} % in encoding
5 \usepackage[T1]{fontenc} % out-encoding f
6 \usepackage{eurosym}
7 \usepackage{lmodern, microtype} % goes OK with T1 fontenc
8 %\usepackage[authoryear, round]{natbib}
9 \usepackage{natbib}
10 \usepackage{color, tikz, graphicx, subfig}
11 \usepackage{amssymb, amsmath, amsthm}
12 \usepackage{setspace, lineno, url, xcolor}
13 \usepackage{savetrees}
14
15 \newcommand{\todo}[1]{\textcolor{blue}{TODO: #1}} % macro for todo entries
16
17 % Style options
18 \renewcommand\familydefault{\sfdefault} % Use with sans serif font
19 \setlength{\bibsep}{0.0pt} % Compact bibliography (natbib)
20
21 \title{Disaggregated Electricity Forecasting using Clustering of Individual Consumers \\
22 {\normalsize \color{gray} IRSDI - RESEARCH INITIATIVE IN INDUSTRIAL DATA SCIENCE}}
23
24 \author{Benjamin Auder \and
25 Jairo Cugliari \and
26 Yannig Goude \and
27 Jean-Michel Poggi
28 }
29 \date{\normalsize\today
30 \vspace{-1.2\baselineskip}}
31
32
33
34 \begin{document}
35 \maketitle
36
37 %\begin{abstract}
38
39 %\end{abstract}
40
41 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
42 %
43 % S E C T I O N
44 %
45 \section{Context}
46
47 \subsection{Industrial}
48
49 Electricity load forecasting is crucial for utilities for production
50 planning as well as marketing offers. Recently, the increasing deployment of
51 smart grids infrastructure requires the development of more flexible data
52 driven forecasting methods adapting quite automatically to new data sets.
53 Electricity load forecasting is crucial for utilities for production planning as
54 well as marketing offers. New metering infrastructures as smart meters
55 provide new and potentially massive informations about individual (household,
56 small and medium enterprise) consumption. As an example, in France,
57 ERDF (Electricite Reseau Distribution de France the French manager of
58 the public electricity distribution network) deployed 250000 smart meters,
59 covering a rural and an urban territory and providing half-hourly household
60 energy used each day. ERDF plans to install 35 millions of them over the
61 French territory by the end of 2020 and exploiting such an amount of data
62 is an exciting but challenging task (see \url{http://www.erdf.fr/Linky}).
63 We propose to build clustering tools useful for forecasting the load
64 consumption. The idea is to disaggregate the global signal in such a way that
65 the sum of disaggregated forecasts significantly improves the prediction of the
66 whole global signal. The strategy is in three steps: first we cluster curves
67 defining super-consumers, then we build a hierarchy of partitions within which
68 the best one is finally selected with respect to a disaggregated forecast
69 criterion. The proposed strategy is applied to a dataset of individual
70 consumers from the French electricity provider EDF. A substantial gain
71 of $16$ \% in forecast accuracy comparing to the 1-cluster approach is provided
72 by disaggregation while preserving meaningful classes of consumers.
73
74 \subsection{Academic}
75
76 In the context of economic seasonal univariate continuous time series, it is often
77 natural to segment it in time, into consecutive curves, for example days, which
78 are then treated as a discrete time series of functions. In particular, in the
79 electrical context, the shape of the curves exhibits rich information about the
80 calendar day type, the meteorological conditions or the existence of special
81 electricity tariffs. Using the information contained in the shape of the load
82 curves leads to very elegant formulation of functional forecasting.
83
84
85 %Electricity load experts naturally look at daily demand data as time functions
86 %called load curves. In a recent paper, \cite{shang2013} uses a functional time
87 %series approach for forecasting short-term electricity demand. This paper is
88 %illustrated by the half-hourly electricity demand from Monday to Sunday in South
89 %Australia. The strategy is also to consider a seasonal univariate time series as
90 %a time series of curves, then to reduce the dimensionality of curves by applying
91 %a functional principal component analysis and finally, following
92 %\cite{shang2011}, the principal component scores are forecasted using a
93 %univariate ARIMA models. In addition, since data points in the daily electricity
94 %demand are sequentially observed, a forecast updating method based on
95 %nonparametric bootstrap approach is proposed to improve the accuracy of point
96 %forecasts. With respect to this strategy, the scheme we propose handles the
97 %forecasting problem in a functional way avoiding the hour by hour processing and
98 %considers a more flexible way to construct the distribution leading to the
99 %prediction interval.
100
101 The shape of the curves exhibits rich information about the calendar day type,
102 the meteorological conditions or the existence of special electricity tariffs.
103 Using the information contained in the shape of the load curves, \cite{antoniadis2012prevision} proposed a flexible nonparametric function-valued
104 forecast model called KWF (\textit{Kernel + Wavelet + Functional}) well suited
105 to handle nonstationary series. The predictor can be seen as a weighted average
106 of futures of past situations, where the weights increase with the similarity
107 between the past situations and the actual one. In addition, this strategy
108 provides with a simultaneous multiple horizon prediction for a global forecast.
109
110 However, there is a need for local electricity load forecasting at different levels of the grid.
111 Bottom-up approaches, based on a two stage process combining clustering and forecasting
112 methods, are a promising perspective. First, it
113 consists in building classes in a population such that each class could be
114 sufficiently well forecast but corresponds to different load shapes or reacts
115 differently to exogenous variables like temperature or prices (see e.g.
116 \cite{labeeuw} in the context of demand response). The second stage consists in
117 aggregating forecasts to forecast the total or any subtotal of the population
118 consumption. For example, identify and forecast the consumption of a
119 sub-population reactive to an incentive is an important need to optimize a
120 demand response program.
121
122 \section{Past work}
123
124 Few papers consider the problem of clustering individual consumption for
125 forecasting (e.g. \cite{iwafune2014short, Alzate, carevic2010applications, MisitiElec}). Recently, \cite{energycon} proposed to build clustering tools useful for the two tasks simultaneously: clustering individual customers and forecasting the load consumption. The idea is to disaggregate the global signal in such a way that the sum of disaggregated forecasts significantly improves the prediction of the whole global signal. The general strategy is in three steps: first we cluster individual curves defining super-consumers, then we built a hierarchy of partitions within which a best one is finally selected with respect to a disaggregated forecast criterion. The predictions are made with the KWF model which allows one to use it as a off-the-shelve tool.
126
127 While this work has ended with an the specification of an algorithm, a current need is a real upscaling proof. A first step on this direction was done in
128 \cite{auder2014}.
129
130
131 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
132 %
133 % S E C T I O N
134 %
135 \section{Aims}
136
137 The method proposed in \cite{energycon} has been successfully tested on a small data set of EDF clients. With the current development of smart meters in France the available volume of individual data is increasing day after day. Then, there is a genuine need of measuring the upscale skills of the existent methods.
138
139 This projet's aim is twofold. First, we will evaluate the upscaling capacity of the strategy developed in \cite{energycon} to cope with the upgrowing volume of data. Second, we will study how to adapt the KWF prediction method to take into account an exogeneous variable. In our particular problem the exogeneous variables can be any meteorological measurement that affects the load demand and is available at the moment of the prediction.
140
141
142 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
143 %
144 % S E C T I O N
145 %
146
147 \section{Means considered}
148
149 \subsection{Methods}
150 \paragraph{Clustering analysis.} In general, clustering methods look for groups of individuals on data in such a way that those belonging to the same group are more similar than those from other groups. Many methods exists to cluster data:
151 hierarchical, center-based, probabilistic, etc. Almost all of them depends heavily
152 on the choice of a similarity measure between individuals. For this challenge we plan
153 to compare individuals in terms of their wavelet spectrum signature. Thanks to this strategy, non
154 stationary signals may be fairly compared. Moreover, the signals need not to be
155 measured on the same temporal grid. However, in order to detect relevant results
156 the wavelet signatures should be corrected by exogenous information (e.g. the one
157 provided as client characteristics).
158
159 \paragraph{Wavelet analysis.} Since the objects to analyze (load curves) can be viewed
160 as functions of time, functional data analysis techniques are one possible choice to
161 represent these objects. From a stochastic point of view the functions are realizations
162 of a non stationary random process. Wavelet transform can be used to extract
163 relevant information about the functions both on time and frequency. With an
164 appropriate representation of the objects, it is then possible to construct
165 a meaningful distance between load curves.
166
167 \paragraph{Forecasting with KWF}
168 The basic idea of nonparametric forecasting is that similar cases in the past
169 have similar future consequences. For example the electricity consumption is
170 divided into blocks of one day size. Then, using a dissimilarity measure, the
171 blocks similar to the last observed block are searched in the past and a vector
172 of weights is built. Finally, the forecast of the next day is obtained by a
173 weighted average of the most similar future days using previous vector of
174 weights. From the statistical point of view, the model is an estimate of the
175 regression function using the kernel method, of the last block against all the
176 blocks in the past. In \cite{antoniadis2006functional} this basic model is
177 extended to the case of stationary functional random variables. But in the
178 context of electrical power demand, the hypothesis of stationarity generally
179 fails: an evolving mean level and the existence of groups that may be seen as
180 classes of stationarity are to be considered. Corrections to take into
181 account these two main nonstationary features are considered in
182 \cite{antoniadis2012prevision} defining a flexible nonparametric function-valued
183 forecast model called KWF (\textit{Kernel + Wavelet + Functional}) well suited
184 to handle nonstationary series. The predictor can be seen as a weighted average
185 of futures of past situations, where the weights increase with the similarity
186 between the past situations and the actual one. Again the similarity is defined
187 thanks to the wavelet decompositions of the two segments.
188
189
190 \subsection{Technology} % to be employed (hardware y software)}
191
192
193 The volume of data to deal for this projet can be handled with standard
194 but recent tools for data analysis.
195 The specific software tools will be statistical programming language like \texttt{R} with some popular
196 libraries (\texttt{data.table}, \texttt{dplyr}) and specific packages to cope with wavelet analysis. All these elements are open source.
197
198 When the computational burden will grow, we have direct access to larger computation capacities.
199
200 All the tools developed on the project will be made available as open source software licences.
201
202 \subsection{Research team}
203
204 The proposed team for developing this projet is composed by theree
205 academic members :
206 \begin{itemize}
207 \item Benjamin Auder, LMO, Univ Paris Saclay
208 \item Jairo Cugliari, ERIC, Univ Lyon
209 \item Jean-Michel Poggi, LMO, Univ Paris Saclay, Univ Paris Descartes
210 \end{itemize}
211
212 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
213 %
214 % S E C T I O N
215 %
216 \section{Data description}
217 \begin{itemize}
218 \item a first dataset already used in \cite{energycon} could be used, at least in a first step, to calibrate the method.
219 \item simulated data could be obtained at EDF following \cite{bondu15} or any simulation method preserving confidentiality
220 of individual consumers. Obviously, any amount of such data could be produced to benchmark the scalability of our approach.
221 \item Irish data provided by the Irish commission for energy regulation consisting in 2000 individual consumption (small and
222 medium enterprise and residential) at an half-hourly resolution as well as pre and post experiment survey (see \cite{Cer_a, Cer_b}).
223 \end{itemize}
224
225
226
227 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
228 %
229 % S E C T I O N
230 %
231 \section{Budget}
232 The expected global budget for the projet is of 15000 \euro, which comprises a 1 day workshop.
233
234 \paragraph{Internal budget} The members of the research team are based on the Paris area and Lyon.
235 The way we work includes video and audio conferences in a regular basis as well as several in-person meetings.
236
237 We plan to present the work on international conferences both on data science and energy oriented meetings.
238
239 Last, a stress test for the upscale skill of the proposed method will need to hire computing time on a specialized platform. We have access to
240 the Centre de Calcul de l'Institut National de Physique Nucléaire et de Physique des Particules (\url{http://cc.in2p3.fr/}) through the laboratory ERIC, Lyon 2.
241
242 \paragraph{Worshop organization on Individual Electricity Consumers}
243 A 1-day workshop dedicated to Individual Electricity Consumers including
244 sessions on data, packages and methods, could be organized in September
245 2017, and could be proposed to The French Statistical Society (SFdS) as a
246 satellite meeting of the Journées de Statistique 2018 which will be held in
247 the campus of EDF Lab in May 2018.
248
249
250 \begin{center}
251 \begin{tabular}{lr} \hline
252 \textbf{Internal budget} & \textbf{10 000 \euro}\\
253 \; Travels & 3 000 \euro\\
254 \; Conference fees & 3 000 \euro\\
255 \; Internal meetings & 2 000 \euro\\
256 \; Hiring of high performance computing time & 2 000 \euro\\
257 \textbf{Worshop organization} & \textbf{5 000 \euro} \\
258 \; Invitations of researchers & 3 000 \euro\\
259 \; Organization workshop & 2 000 \euro\\ \hline
260 \textbf{Global budget} & \textbf{15 000 \euro} \\ \hline
261 \end{tabular}
262 \end{center}
263
264
265 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
266 %
267 % S E C T I O N
268 %
269 \section{Vitas}
270
271 \paragraph{Benjamin Auder} is CNRS Research Engineer at LMO, University Paris-Sud Orsay in France.
272 He obtained his PhD in statistics in 2011 at the university Université Pierre et Marie Curie, Paris.
273 His main research areas are Clustering, dimensionality reduction, manifold learning, machine learning
274 in addition to software development and implementation issues of algorithmic solutions.
275
276 (\url{http://auder.net/page-upsud/})
277
278 \paragraph{Jairo Cugliari} is Assistant Professor of Statistics at University of Lyon in France. He obtained his PhD in statistics
279 in 2011 at the university Paris-Sud 11 Orsay. His main research areas are functional data analysis methods
280 for classification and prediction for applied statistical problems.
281
282 (\url{http://eric.univ-lyon2.fr/~jcugliari/})
283
284
285
286 \paragraph{Jean-Michel Poggi} is Professor of Statistics at University of Paris Descartes
287 and at University Paris-Sud Orsay in France. His main research areas are
288 tree-based methods for classification and regression, nonparametric time
289 series forecasting, wavelet methods and applied statistical modeling in energy
290 and environment fields. His publications combine theoretical and practical
291 contributions together with industrial applications and software development.
292
293 \noindent
294 He is an elected member of the ISI, he was President of the French Statistical
295 Society (SFdS) and he is Vice-President of the FENStatS, Vice-President of ENBIS and President of ECAS.
296
297 (\url{http://www.math.u-psud.fr/~poggi/})
298
299 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
300 %
301 % S E C T I O N
302 %
303 \section{Associated industrial company} % And members
304
305
306 \paragraph{Yannig Goude} is a research-engineer/project manager at EDF R\&D and associate
307 professor at University Paris-Sud Orsay, France. He obtained his PhD in statistics and probability
308 in 2008 at the university Paris-Sud 11 Orsay. His research interests are electricity load forecasting,
309 more generally time series analysis and forecasting, non-parametric models and expert aggregation.
310
311 (\url{https://fr.linkedin.com/in/yannig-goude-768b3980})
312
313 \bibliographystyle{plain}
314 \bibliography{biblio_irsdi} %,predintervals,rapportfinal}
315
316 \end{document}
317
318
319
320 \bibitem{Alzate}
321 C.~Alzate and M.~Sinn,
322 Improved electricity load forecasting via kernel spectral clustering of
323 smartmeter,
324 \emph{International Conference on Data Mining}, vol. 948, pp. 943 -- 948,
325 2013
326
327 \bibitem{antoniadis2006functional}
328 A.~Antoniadis, E.~Paparoditis and T.~Sapatinas,
329 A functional wavelet-kernel approach for time series prediction,
330 \emph{Journal of the Royal Statistical Society, Series B},
331 vol. 68(5), pp. 837 -- 857, 2006
332
333 \bibitem{antoniadis2013clustering}
334 A.~Antoniadis, X.~Brossat, J.~Cugliari, and J.-M.~Poggi,
335 Clustering functional data using wavelets,
336 \emph{International Journal of Wavelets, Multiresolution and Information
337 Processing},
338 vol. 11(1), 2013
339
340 \bibitem{antoniadis2012prevision}
341 A. Antoniadis, X. Brossat, J. Cugliari, J.-M. Poggi,
342 Pr\'{e}vision d'un processus \`{a} valeurs fonctionnelles en pr\'{e}sence de
343 non stationnarit\'{e}s. Application \`{a} la consommation
344 d'\'{e}lectricit\'{e}
345 Journal de la Soci\'{e}t\'{e} Fran\c{c}aise de Statistique,
346 Vol. 153, No. 2, 52--78, 2012
347
348 \bibitem{brabec2015statistical}
349 Brabec, M. and Kon{\'a}r, O. and Mal{\`y}, M. and Kasanick{\`y}, I and Pelik{\'a}n, E.,
350 Statistical models for disaggregation and reaggregation of natural gas
351 consumption data,
352 \emph{Journal of Applied Statistics}, vol. 42(5), pp. 921--937, 2015
353
354 \bibitem{carevic2010applications}
355 Carevi{\'c}, S. and Capuder, T. and Delimar, M.
356 Applications of clustering algorithms in long-term load forecasting
357 \emph{Proceedings Energy Conference and Exhibition (EnergyCon),
358 2010 IEEE International} 688--693, 2010
359
360 \bibitem{Chicco}
361 G. Chicco
362 Overview and performance assessment of the clustering methods for electrical
363 load pattern grouping, Energy , 42, 68 -- 80, 2012.
364
365 \bibitem{Figueiredo}
366 Figueiredo, V., Rodrigues, F., Vale, Z., Gouveia, J. B.
367 An electric energy consumer characterization framework based on data mining
368 techniques.
369 Power Systems, IEEE Transactions on, 20(2), 596--602, 2005
370
371 \bibitem{iwafune2014short}
372 Iwafune, Y., Yagita, Y., Ikegami, T., Ogimoto K.
373 Short-term forecasting of residential building load for distributed energy
374 management
375 \emph{Proceedings Energy Conference (ENERGYCON), 2014 IEEE International}
376 1197--1204, 2014
377
378 \bibitem{kaufmanpj}
379 Kaufman, L. and Rousseeuw, P
380 Finding groups in data: An introduction to cluster analysis,
381 Hoboken NJ John Wiley \& Sons Inc, 1990
382
383 \bibitem{Kwac}
384 J. Kwac, Flora, J., Rajagopal, R.
385 Household Energy Consumption Segmentation Using Hourly Data
386 Smart Grid, IEEE Transactions on, 5, 420--430, 2014
387
388 \bibitem{labeeuw}
389 Labeeuw, W., Stragier, J., and Deconinck, G.
390 Potential of active demand reduction with residential wet appliances:
391 A case study for Belgium.
392 Smart Grid, IEEE Transactions on, 6(1), 315--323, 2015
393
394 \bibitem{Liao}
395 Warren Liao, T.
396 Clustering of time series data--a survey
397 Pattern recognition, 38(11), 1857--1874, 2005
398
399 \bibitem{MisitiElec}
400 M.~Misiti, Y.~Misiti, G.~Oppenheim, and J.-M.~Poggi,
401 Optimized Clusters for Disaggregated Electricity Load Forecasting,
402 \emph{REVSTAT -- Statistical Journal}, vol. 8(2), pp. 105 -- 124, 2010
403
404 \bibitem{Mutanen}
405 Mutanen, A., Ruska, M., Repo, S., Jarventausta, P.
406 Customer classification and load profiling method for distribution systems.
407 Power Delivery, IEEE Transactions on, 26(3), 1755--1763, 2011
408
409 %\bibitem{Piao}
410 %Piao, M., Lee, H. G., Park, J. H., Ryu, K. H.
411 % Application of Classification Methods for Forecasting Mid-Term
412 % Power Load Patterns.
413 % In Advanced Intelligent Computing Theories and Applications. Springer, 2008
414
415 \bibitem{Rasanen}
416 T., R\"{a}s\"{a}nen, D., Voukantsis, H., Niska, K., Karatzas, M., Kolehmainen
417 Data-based method for creating electricity use load profiles using large
418 amount of customer-specific hourly measured electricity use data
419 Applied Energy, 87(11), 3538--3545, 2010
420
421 \bibitem{Rhodes}
422 J.D. Rhodes, W.J. Cole, C.R. Upshaw, T.F. Edgar, M.E. Webber
423 Clustering analysis of residential electricity demand profiles
424 Preprint submitted to Applied Energy, March 18, 2014
425
426 \bibitem{steinley2008new}
427 D. Steinley and M. Brusco,
428 A new variable weighting and selection procedure for k-means cluster analysis.
429 \emph{Multivariate Behavioral Research}, 43:32, 2008.
430
431 \bibitem{wijaya2015forecasting}
432 Wijaya, T. K., Sinn, M., and Chen, B.,
433 Forecasting Uncertainty in Electricity Demand,
434 \emph{AAAI-15 Workshop on Computational Sustainability, EPFL-CONF-203769},
435 2015
436
437 \bibitem{Zhou}
438 K. Zhou, S. Yang, C. Shen
439 A review of electric load classification in smart grid environment,
440 Renewable and Sustainable Energy Reviews, 24, 103 -- 110, 2013.
441