Mixture modelling concerns modelling a statistical distribution by a mixture (or weighted sum) of other distributions. Mixture modelling is also known as "unsupervised concept learning" in Artificial Intelligence, "latent class analysis" in Psychology, "intrinsic classification" in Philosophy, and "classification", "clustering" or "numerical taxonomy" in other contexts. All of these applications involve sampling from a heterogeneous population. The variable of interest follows a different distribution in each component population but it is not possible to observe directly which component a sampled individual comes from.
In its purest form, a mixture problem involves making inferences about the weights and the component distributions given only a sample from the mixture. The closer the component distributions are to each other, the more difficult this will be. What we are trying to do is to infer a joint distribution from one of its margins so we should expect problems of identifiability and computational instability, and any solution may be heavily model-dependent.
The situation is much better if there is even a bit of additional information, and that is what these two examples are about. The challenge is to find an appropriate model and a good way to fit it and make statements about the quantities of interest, whether by maximum likelihood (with your choice of direct search, quasi-Newton, EM algorithm or some other optimization), nonparametric or Bayesian analysis. Any of the things I can think of doing would be easy to program in Splus; you don't need to buy custom mixtures software or program in C++.
Data - Nerve Fibre Diameter
Ronald H. Stead, PhD, FRCPath, President and Scientific Director, Holburn Biomedical Corporation, and Associate Professor of Pathology and Molecular Medicine, McMaster University, has studied the effect of nematode infection on nerve fibres in the gut.
Infection with the nematode Nippostrongylus brasiliensis is known to cause degranulation of intestinal mucosal mast cells and subsequent mast cell hyperplasia, as part of the inflammatory response to this organism. Dr Stead had shown that mast cells in the gut mucosa are innervated, and further noted that, during the phase of mast cell activation following Nippostrongylus brasiliensis infection, the ultrastructure of the nerves was consistent with degeneration. He therefore hypothesized that nerve regeneration should accompany the ensuing mast cell hyperplasia and wished to test this hypothesis by looking at mucosal nerve fibre cross-sectional area measurements. This was in accordance with earlier work, showing that regenerating axons have smaller diameters than established nerve fibres. Pieces of affected bowel from rats infected with the nematode were studied histologically, using an immunocytochemical method to localize a nerve-related protein called GAP-43 (or B-50). Microscopical image analysis was then performed using a Quantimet, which allowed measurement of the cross-sectional areas of the nerve profiles (in square microns), as well as the density of nerve profiles in the mucosa. The outcome of this study was that there was significant evidence to support nerve degeneration during the acute phase of the inflammatory response to the nematode, followed by subsequent re-innervation, and an eventual 30% increase in mucosal nerve fibres several weeks after infection. Details of this study can be found in J. Neurosci., 1991;11:3809.
The distribution of fibre diameter was found at 0, 10 and 14 days following infection. You can get the data as the Worksheet "nerve gut" in the Excel 97/98 Workbook mixtures.xls (14 K), or as a text file. The data have been binned or grouped; the column "Nerve fibre diameter" gives right cut-points for the bins. The final bin is open-ended so the cut-point is shown as "*". It appears as though there are small (regenerating) fibres, large (degenerate) fibres, and fibres of intermediate size, and the mean sizes of regenerating and degenerate fibres remain constant while the relative proportions change in response to the infection.
Data - Yellowtail Length and Age
Consent notification: The raw data are not to be included in any table or appendix of any scientific journal but can appear, in summary form only, with the expressed written permission from the Department of Fisheries and Oceans.
Yellowtail flounder (Pleuronectes ferrugineus) is a right-eyed small-mouthed flounder from the Family Pleuronectidae. They are distributed in the western North Atlantic from Chesapeake Bay to the coast of Labrador. In the United States, they are sometimes called rusty dab. Yellowtail flounder gets its name from the unique yellow colour found along the ventral caudal fin and the margins of the two long fins on the blind (white) side. They live on sand or sandy-mud and feed primarily on polychaete worms and crustaceans, such as amphipods. The depth range in which yellowtail flounder are found is 37 to 91 m.
Spawning occurs in late spring or early summer, but may last into September. Yellowtail flounder are not known to migrate to spawn and they spawn near the bottom. This flounder produces a large number of small eggs and larvae, which are pelagic for the first few months of life.
Yellowtail flounder is a commercial species on the Grand Bank (Northwest Atlantic Fisheries Organization, Divisions 3LNO), and there has been a fishery for this flounder since the 1960s. The fishery is mainly by otter trawl and they start showing up in the catches at age 4 years and are fully recruited to the fishery at age 7. Males reach first maturity at age 5 and females at age 6. Yellowtail flounder are a relatively short-lived species, reaching a maximum age of 13 years, in comparison to some other flounders whose ages have been recorded in the 20s.
The length frequency data of yellowtail flounder were obtained during the 1987 annual fall multi-species bottom trawl survey, carried out by the Department of Fisheries and Oceans. The survey gear used in these surveys is a small mesh shrimp trawl capable of sampling all age classes. These surveys covered the entire Grand Bank each year.
The 1987 data can be found in the Worksheet "yellowtail 1987" in the Excel 97/98 Workbook mixtures.xls (14 K), or as a text file. The sample included 9936 fish. Each fish was measured for length and the data were binned into length classes. The column "length class" gives the right cut-point of each bin in cm. Some fish were subsampled from each length class and aged by counting rings on their otoliths.
Karen would like to know the mean (and standard deviation) of length at each age and the relative proportion of each age group in the population. Some measure of the accuracy of the estimates would be useful. She would also like to know if the aging is valid, at least to be able to say whether or not the length data and age-at-length data are consistent with each other.
Using only the length data, could you fit, say, a mixture of normal components and estimate the number of age groups in the sample?
Since we have the distribution of length and the distribution of age-at-length, is it possible to find a nonparametric estimate of the distribution of length-at-age and hence estimate the mean length at each age without making any distributional assumptions?
Feel free to ask me (Peter Macdonald) any questions on mixtures analysis and I will try to help or direct you to an appropriate expert.
There is an extensive collection of internet resources for mixtures on David Dowe's mixture modelling page.
A comprehensive historical bibliography of work up to about 1985 can be found in
Titterington, D.M., A.F.M. Smith and U.E. Makov (1985). Statistical Analysis of Finite Mixture Distributions, Wiley, New York. x+243 pp.
For an introduction to length-frequency analysis and fish aging methodology, see
Summerfelt, R.C. and G.E. Hall [eds.] (1987) Age and Growth of Fish. Iowa State University Press, Ames.
My MIX web site has a number of examples of fitting finite mixtures of normal, lognormal, gamma and Weibull distributions, and a short bibliography which needs to be brought up to date. I don't recommend using MIX for this Case Study because it can't do the analyses that I think are the most interesting. For example, I used MIX to analyse the nerve fibre diameters as mixtures of lognormal distributions in
Macdonald, P.D.M. (1991). Practical aspects of fitting mixture distributions to data, 1991 Proceedings of the Biopharmaceutical Section, American Statistical Association, 78-83.
but as I explained in that paper, the samples at all three times need to be analysed together and MIX won't do that. Also, I can't justify using a lognormal distribution for the intermediate fibres.