Miniconference
​
Statistical Advances for Real Data Problems
8 February 2018, Paris
​A. Guilloux
Titre Titre Titre Titre Titre Titre
​
Résumé. Cliquez ici pour ajouter votre propre texte. Cliquez sur "Modifier Texte" ou double-cliquez ici pour ajouter votre contenu et personnaliser les polices. Déplacez-moi où vous le souhaitez sur votre page. Expliquez ici votre parcours et présentez votre activité à vos visiteurs. Consultez mon site
​A. Guilloux
Sorted-L1 norm for outliers detection and robust regression for large datasets
​
The problem of outliers detection will be introduced via a biological example (concerning some genomic aspects of tumors). We will show that considering the mean-shift outlier model in generalized linear regression, where outliers are encoded by an individual intercept, permits to address this biological problem. In the simple linear model, we will then propose to robustly estimate the parameters of the model and detect in the same time outliers. Towards that end, we will introduce a SLOPE penalty for the mean-shift parameters. We will then give theoretical guarantees under sparsity assumptions on the vector of individual intercepts.
​
ABSTRACTS
Emilie Devijver
​
Network inference for "omic" data
​
In this talk, I will discuss some recent developments about the network structure of genes. First, a network inference method will be introduced, which decomposes the network into several independent clusters. This structure happens to be quite interesting for gene regulatory network, giving an easy tool to biologists to detect interesting groups of genes. For example, this structure has been used to predict the olfactory behavior in the drosophila, and key genes (already known for some of them) have been detected. However, as every network inference method, the stability problem is of great interest. We will discuss some topological considerations which may answer this question.
E. Kuhn
Testing variance components in nonlinear mixed effects models. Application to plant growth modeling
​
Mixed effects models are widely used to describe inter and intra individual variabilities in a population. A fundamental question when adjusting such a model to the population consists in identifying the parameters carrying the different types of variabilities, i.e. those that can be considered constant in the population, referred to as fixed effects, and those that vary among individuals, referred to as random effects. In this talk, we propose a test procedure based on the likelihood ratio one for testing if the variances of a subset of the random effects are equal to zero. The standard theoretical results on the asymptotic distribution of the likelihood ratio test can not be applied in our context. Indeed the assumptions required are not fulfilled since the tested parameter values are on the boundary of the parameter space. The issue of variance components testing has been addressed in the context of linear mixed effects models by several authors and in the particular case of testing the variance of one single random effect in nonlinear mixed effects models. We address the case of testing that the variances of a subset of the random effects are equal to zero. We proof that the asymptotic distribution of the test is a chi bar square distribution, indeed a mixture of chi square distributions, and identify the weights of the mixture. We highlight that the limit distribution depends on the presence or not of correlations between the random effects. We present numerical tools to compute the corresponding quantiles. Finally, we illustrate the finite sample size properties of the test procedure through simulation studies and on real data.
P. Latouche
The stochastic topic block model
​
Due to the significant increase of communications between individuals via social media (Facebook, Twitter, Linkedin) or electronic formats (email, web, e-publication) in the past two decades, network analysis has become a unavoidable discipline. Many random graph models have been proposed to extract information from networks based on person-to-person links only, without taking into account information on the contents. This talk will introduce the stochastic topic block model (STBM), a probabilistic model for networks with textual edges. We will address here the problem of discovering meaningful clusters of vertices that are coherent from both the network interactions and the text contents. A classification variational expectation-maximization (C-VEM) algorithm will be proposed to perform inference. Finally, we will rely on the methodology to study the Enron political and financial scandals.
C. Lévy-Leduc
Statistical methods for analyzing data coming from molecular biology
​
In this talk, I will present new statistical approaches for analyzing different kinds of data coming from molecular biology: HiC data and
"-omic" data. More precisely, in the first part, I will propose some two-dimensional segmentation approaches for processing large data matrices containing HiC data.
In the second part, I will present variable selection approaches in the multivariate linear model taking into account the dependence that may
exist between the columns of the observation matrices for analyzing
"-omic" data. In both cases, I will explain the statistical methodology and the results obtained from the theoretical, numerical and practical point of view.
M. Mougeot
Statistical and machine learning methods to model and forecast energy consumption or production
​
Since electricity can hardly be stored, forecasting tools are essential to appropriately balance consumption and production of energy, including renewable energies. Analyzing historical data shows that time series of energy production or consumption may be radically different.
Consequently, adapted statistical tools and methods should be used to model or forecast energy in both cases. Based on a sparse learning process for
functional regression, a “prediction box”, has been introduced to forecast energy consumption. This model allows forecasting in a high dimensional framework the intra day load curves of the French national consumption. On the other hand, to model and to forecast the wind energy, machine learning and aggregation techniques appeared to be more appropriate.