Small Area Estimation
May 23, 9:00 am – 12:00 pm, 1:30 – 4:00 pm
Pascal Ardilly, Institut national de la statistique et des études économiques (INSEE), France
When estimating parameters defined on populations of small size from survey sample data (unemployment rate, proportion of the poor, average income, etc.), we are faced with the problem of poor quality estimates stemming from classical methods. This is an automatic consequence of the small sample size that intersects these populations (called domains), which can be small areas, like for example population aggregates, or sub-populations defined by crossing sufficiently detailed socio-demographic criteria (example: working women less than 30 years old with 2 children). To improve the precision of estimates, it is thus necessary to use auxiliary information obtained from comprehensive sources, or failing this, from very large survey samples. In this manner, we have at our disposal a collection of so-called small area estimation methods that are organized by the way in which auxiliary information is exploited and summarized as follows.
A first approach involves an adjustment on set of known auxiliary variables for a group of individuals in a small domain. In weighting the sampled units in such a way so as to recover the exact structures known at the level of the small domain, we can very noticeably reduce the sampling error without requiring any particular hypothesis on the behaviour of individuals.
A second class of methods relies on hypotheses of a descriptive nature focused on certain components of the parameter to be estimated. These hypotheses compare a local average behaviour (in the small domain) to an overall average behaviour (in the complete population). For example, to estimate an average in a small domain, we decompose the entire population into ad-hoc sub-populations and we suppose that for each of these sub-populations, the real average restricted on the small domain is equal to the real average over the entire population. We can also assume hypotheses of a similar nature on regression coefficients rather than on averages; that is, by considering that a relationship between variables in the small domain is identical to that in the entire population. This type of hypothesis allows for calculation of a local estimate that calls upon the set of sampled units which contribute to stabilizing the estimate and therefore, reducing the overall sampling error- admittedly at the expense of the bias which depends on the suitability of the hypotheses.
A third approach, certainly the most common and the most varied, depends on a stochastic modelling of behaviours, the modelled unit being either the individual of the population or the parameter of interest defined on the small domain. There exist numerous models, more or less concurrent, to produce estimates (linear mixed models, general linear mixed models, Bayesian techniques, …), but in all the cases, the underlying principal is the following: Starting from very explanatory auxiliary information available for everyone in the population, we estimate the parameters of the model using the entire sample and afterwards, we calculate a local estimate that depends on these parameters. In this way, we benefit from a great deal of stability of local estimates since they integrate the set of sampled units within and outside of the small domain. Obviously, the appropriateness of this approach is dependent on the validity of the model but in theory, we have quality indicators to judge it.
The course will review these approaches, theory and examples, and will strive to detail the contributions and limitations.
The workshop will be presented in French with French slides.