Case Study #2: Can Gene Expression Data Identify Patients with Inflammatory Bowel Disease?


Data Source: 

Global gene expression data, IBD candidate genes


Pingzhao Hu


Background: Inflammatory bowel disease (IBD), which is comprised of the two disease entities of Crohn’s disease (CD) and ulcerative colitis (UC), is an incurable gastrointestinal illness that results in chronic inflammation. IBD greatly affects patients’ quality of life. Approximately 1.5 million people have IBD in the United States and Canada, where the rates are among the highest in the world. There are currently no biomarkers for IBD, which could help to identify better treatments and individualize patient care. Such biomarkers could also be used to facilitate the development of clinical trials involving new medications. Recently, genome-wide association studies (GWAS) have significantly advanced our understanding about the importance of genetic susceptibility in IBD. Studies have identified a total of 201 IBD loci (Liu et al. 2015). However, these loci have yielded only a handful of candidate genes which often have small contributory effect in IBD. 


Research Question: 


The aim of this case study is to construct classifiers for IBD using global gene expression data based on these candidate genes. The research questions are:

  1. Can data features (i.e., variables or probesets or genes) be used to cluster individuals into three biological groups (i.e., healthy individuals, CD patients, UC patients)? 
  2. Can data features (i.e., variables, probesets or genes) predict the disease state of individuals from three biological groups (i.e., healthy individuals, CD patients, UC patients)? 



See below for a description of the study variables.

Data Sources:

Global gene expression data: Burczynski et al. (2006) generated genome-wide gene expression profiles for 41 healthy individuals (note that the processed data includes only 41 individuals although the original study included 42 individuals), 59 CD patients, and 26 UC patients using Affymetrix HG-U133A human GeneChip array.  The GeneChip include approximate 22,000 probesets (each gene may have multiple probesets). The expression level of each probeset in each individual was quantified using MAS 5.0 software (we downloaded the processed data from ArrayExpress: E-GEOD-3365). 


IBD candidate genes: IBD candidate genes implicated in the 201 IBD associated loci were evaluated using GRAIL (Gene Relationships across Implicated Loci) and DAPPLE (Disease Association Protein-Protein Link Evaluator) software tools. A total of 225 unique genes (see Supplementary Table 9 of Liu et al. 2015) were identified, and 185 of these 225 genes are on the Affymetrix HG-U133A human GeneChip array.  These 185 candidate genes include 309 probesets.



Data Access: 


Two data files will be used for this case study:

IBDMatchedGenes (Sheet 2): The first column (Probe.Set.ID) contains the names of the 309 probesets (i.e., features). The second column (Gene.Symbol) contains the gene symbols for 185 unique genes. Some genes include two or more probesets. The analysis can be performed at either the probeset or gene level.

IBDGeneExpression (Sheet 1): In this dataset, the rows correspond to probesets and the columns represent 126 individuals. The first column contains the probeset names and the first row contains the patient IDs. The biological group information for the 126 individuals is shown in the individual IDs. 


Data Files: 



Liu JZ, van Sommeren S, Huang H, et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat Genet. 47(9):979-86 (2015).

Ron LP, Natalie CT, Krystyna AZ, et al. Molecular classification of Crohn's disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells. Michael E Burczynski, J Mol Diagn 8(1):51-61 (2006).

Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA. 99(10):6562-6 (2002).

Dupuy A, Simon RM. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J Natl Cancer Inst. 99(2):147-57 (2007).