# Special issue on Big Data and the Statistical Sciences: Guest Editor's Introduction

The era of Big Data is well underway. It presents, on the one hand, clear challenges to the discipline of statistics and to statisticians and, on the other hand, many opportunities for statistical scientists. We statisticians are challenged to show our leadership in what is clearly our traditional domain: data. At the same time Big Data offers many opportunities for statistical scientists to push science, technology, and engineering forward and to show that the basic ideas of our field remain relevant, nay critically important, in this new era.

Current chatter around the meaning of Big Data shows that the terms “statistician” and “data scientist” are widely used, outside the community to which I belong, to apply to a much wider group than that community. Essentially every discipline has data; with those data come discipline- specific data scientists. Techniques and jargon develop independently of work in other fields. More importantly for the readership of The Canadian Journal of Statistics those techniques often ignore the more encompassing, general work of statisticians such as we. We understand well that the basic ideas underlying our study of data analytic techniques apply in all sorts of contexts and that lessons learned in one context have important value in other contexts in entirely different disciplines.

This issue of The Canadian Journal of Statistics is therefore dedicated to Big Data and the Statistical Sciences and to highlighting both the value of classical statistical thought in approaching novel large-scale data problems and the challenges facing the professional statistical community. In this issue you will see that these classical statistical ideas continue to have a crucial role to play in keeping data analysis honest, efficient, and effective. You will see opportunities for new statistical methodology built on old statistical ideas across a wide spectrum of applications. You will see that huge new computing resources do not put an end to the need for careful modelling, for honest assessment of uncertainty, or for good experimental design.

We have here both review articles and methodological proposals. Some are Bayesian in view, some are frequentist, and some are clearly both. We cover experimental design, Official Statistics, modern genetics, on-line methods, Markov Chain Monte Carlo, functional data, graphical models, dimension reduction, local methods, model selection, post-selection inference, high-dimensional limit theory, and many more ideas. I want in the remainder of this introduction to highlight a few of those ideas, draw some connections to the challenges I have mentioned, and perhaps point out places where our community has particular obligations.

Mary Thompson looks at the impact of Big Data on Official Statistics in a wide ranging review. She articulates many ways in which our ability to collect more data with much greater complexity and to fit much larger models will change the way agencies like Statistics Canada do their work. For instance, some concepts are traditionally defined in terms which suit the way they are measured more than the underlying idea; access to larger and timelier data sources may change this balance. As another example, most statistical agencies are moving rapidly to augment, or replace, traditional survey data with administrative data and to use the paradata gathered automatically as part of electronic data collection methods; statisticians will need to cope with data quality issues in the administrative data, at least, because those data were not gathered for the statistical agency’s purpose. Thompson considers carefully the impact of continuous or rolling data collection and discusses the future use of visualization in Official Statistics before concluding with an important list of research topics needing attention from statisticians. The topics show clearly that one impact of Big Data is positive for statisticians: there are many new and serious problems squarely situated within our field.

We are in the midst of a biological revolution in which gene sequencing and related techniques have transformed the way we seek to understand diseases and other biological processes. Shelley Bull, Irene Andrulis, and Andrew Paterson consider Molecular and Genetic Epidemiology and use two multidisciplinary collaborations (one in breast cancer and one in diabetes) to illustrate the interplay between multiple studies and multiple techniques, both experimental and statistical, in building an understanding of a particular disease. The authors demonstrate that there are continued roles for classical statistical ideas once these are stepped-up to work in more complex situations. At the same time they highlight the need for statistical ideas which cope with model misspecification, high-dimensional parameter spaces, and the associated impact of model selection on inference.

One approach to this last problem is illustrated by Jonathan Taylor and Robert Tibshirani who consider post-selection inference for penalized likelihood models. In the Gaussian case there is now a body of work by Taylor, Tibshirani, and co-authors showing how to make exact, conditional inferences. The approach competes with the high-dimensional limit theory built on modern empirical process techniques which provides unconditional but approximate inferences. In the paper at hand, Taylor and Tibshirani extend these conditional inference ideas to general likelihood contexts with LASSO penalty structures.

The competing high-dimensional limit theory approach occurs in many forms. One of these forms appears in the contribution of Dennis Cook and Liliana Forzani. The focus here is on dimension reduction via partial least squares regression. The paper illustrates the recently established view of asymptotic analysis in which a sequence of models of varying dimensions and therefore varying true parameter values is considered. Assumptions necessarily apply to the true parameter values, and the resulting approximations can be good only in parts of the parameter space. In the Big Data context this is the only way forward.

The technical component of Cook and Forzani is preceded by an introduction which I hope will be widely read. It throws down an important list of challenges to our community and highlights some very negative views of our discipline from beyond our community. We need to face up to these criticisms and seek the sort of self-awareness which will let us understand their source.

Dimension reduction is also a component of Bing Li’s paper, which sets out a vision of a unified paradigm for the statistical analysis of Big Data. Studying a variety of contexts including both multivariate and functional settings using both linear and nonlinear models, Li explores the role of linear operators in statistical analysis. Five particular operators on Hilbert spaces are considered carefully, and the ideas are illustrated through the case of sufficient dimension reduction. An important feature of the paper is the structured discussion of “functional data analysis” on the one hand and “kernel learning” on the other.

The competition among ideas for data analysis is sharply highlighted by the changing jargon of our discipline. Beyond the changing names for the practitioners highlighted above we have the change in names for techniques, with older statistical jargon often being replaced or modified by machine learning jargon. Rui Nie, Douglas Wiens, and Zhichun Zhai consider active learning and explore the relation of this idea to the traditional statistical field of optimal experimental design. They demonstrate clearly that classical statistical ideas remain important in this Big Data context. The goal is regression: modelling the impact of predictors on a response. In this paper the predictors used in training are drawn from a different density than they would be when used in test data. On the basis of the training data some parametric regression model will be estimated and then used to predict responses at unobserved values of the covariates. The paper focuses on the impact of bias arising from mis-specification of the parametric model and shows how optimal design ideas can be used to control the impact of that bias. The paper highlights, I think, a crucial issue in the Big Data era. Historically it has been common to feel comfortable assuming that bias is small compared to sampling variability. In huge data sets, however, where sampling variability is negligible, bias will not be negligible at least by comparison and often, I would argue, not negligible at all.

Statisticians have long understood that bias is a concept relative to a data structure and a model. In an iid sampling context the addition of a new predictor to a regression changes the model and the parameters. Chun Wang, Ming-Hui Chen, Jing Wu, Jun Yan, Yuping Zhang, and Elizabeth Schifano study an on-line inference problem. They consider data which arrive in a stream but where the set of predictor variables available may grow from time to time. With new predictors in hand one wants to use the new data without abandoning what one has learned from the old data. The inferential target is dynamic but the approach here shows that there is plenty of scope for classical statistical ideas to increase the efficiency with which data are used in dynamic contexts.

The competition between Bayesian and frequentist approaches to statistics does not seem to have been settled by the advent of the Big Data era. On the one hand, high-dimensional regression and network modelling have attracted a very considerable amount of attention in the form of frequency theory approaches. On the other hand, many have argued that only Bayesian methods can really work in complex settings. It seems possible that a pragmatic consensus is emerging in which many are willing to try whatever method seems most convenient for the problem at hand.

On the Bayesian end of things in this number is the paper by Reihaneh Entezari, Radu Craiu, and Jeffrey Rosenthal, which looks at the problem of Markov Chain Monte Carlo in settings where the computation needs to be parallelized to make statistical inference feasible. The authors show us how to run separate chains on each of several portions of the data, inflating the likelihoods for each portion. The resulting set of approximate posteriors is then combined to form a single approximation to the full posterior. A binomial example shows that careful partitioning of the full data set can sometimes make the approximation very good indeed, and a Bayesian regression example shows that the method improves usefully on earlier partitioning efforts. Finally the method is applied to Bayesian Regression Trees, a central Big Data tool.

Another Bayesian view which highlights the interplay between the two schools of inference is provided by the paper by Qiong Li, Xin Gao, and He ́le`ne Massam studying “coloured graphical Gaussian models.” Here we have a high-dimensional multivariate normal sample and are interested in structured inference for the precision matrix of this multivariate normal law; the adjectives “coloured” and “graphical” describe particular structure imposed on this matrix. The high-dimensional setting makes computation hard; the authors present a method for local analysis which uses the graph structure to distribute the computational problem. They go on to provide a frequency theory analysis of the behaviour of the resulting Bayesian estimators in both the fixed dimension and growing dimension regimes.

I believe that this issue of The Canadian Journal of Statistics highlights the kinds of contributions statisticians are making to Big Data. We are showing people that statistical ideas remain relevant in the face of massive, complex, dynamic data. But we are also seeing that we need to make progress quickly to adapt those ideas to the contexts before other, more ad hoc, techniques fully occupy the field.

Richard Lockhart, (2018) 'Special issue on Big Data and the Statistical Sciences: Guest Editor's Introduction', Canadian Journal of Statistics, 46(1), March 2018, doi:10.1002/cjs.11350