Proteomic Biomarkers for Disease Status


Data Source: 

Rob Balshaw of Syreon Corporation


Alison Gibbs, Department of Statistics, University of Toronto


A biomarker is one or more biological parameters associated with the presence and severity of specific disease states. Biomarkers are detectable and measurable by a variety of methods including physical examination, laboratory assays, and medical imaging.

Proteomics is the large-scale study of proteins. In this case study, we are interested to learn if we can correctly identify the disease state of a patient using a proteomic biomarker. That is, can an invasive and expensive medical test be replaced by a blood test?

We have collected several hundred blood samples from patients with a chronic medical condition which has two states: active and inactive. You can imagine the disease to be analogous to multiple sclerosis or some recurrent forms of cancer. Patients are known to have a very serious condition, but it is usually in a reasonably mild state with occasional flare-ups. Patients in the active state require urgent treatment with quite aggressive medications with potentially serious side effects (e.g., serious viral and bacterial infections, kidney or liver damage, and possibly even cancer).

Unfortunately, the current method for determining when these flare-ups are occurring requires a pathologist to review a carefully processed sample taken from the patient’s internal organs using a large needle. This evaluation method is both invasive and expensive, leading to discomfort for the patients and increased burden on our health care system.

Our Data

Our dataset includes 11 samples from active patients, 21 samples from inactive patients and an additional 15 unidentified samples (we have not indicated whether they are from active or inactive patients). All samples may be treated as independent (i.e., taken from independent patients). Though our dataset has a ratio of approximately 2 inactive patients to 1 active patient, approximately 10% to 30% of real patients are active.

Protein Level Determination

The abundance of a protein in a blood sample is measured by its quantity relative to the quantity of the corresponding protein in a reference sample. A multiplex proteomic technology, called iTRAQ, was used to measure the relative protein abundances.

The reference samples were taken from a homogeneous batch of blood pooled from samples from 16 healthy volunteers. At the time we processed our samples, four iTRAQ reagents were available allowing us to process three experimental samples from patients and one reference sample in each run. The iTRAQ data are expressed as ratios between the experimental sample and the reference sample for each of the several hundred proteins identified in each run. Since the same reference sample is used in all runs, this method gives a measure of relative abundance that is comparable for all experimental runs. Each run of the experiment detects and measures up to several hundred proteins.

Here is a brief description of the sample preparation which may provide some insight into the data. Blood samples are taken from the different subjects. Plasma was obtained from each whole blood sample through centrifugation, separated into aliquots, and stored until the proteomic analysis. Proteins in the plasma have abundance levels that range over approximately 6 orders of magnitude but the iTRAQ machine has only a 100-fold dynamic range. Hence, our plasma samples were depleted of the 14 most abundant plasma proteins to reduce the dynamic range and enhance our ability to quantify the more interesting but less abundant proteins. After this depletion process, the remaining proteins in each sample were then digested (i.e., chopped into protein fragments called peptides) and the peptides were labelled with one of the four iTRAQ reagents (i.e., chemical tags with distinct molecular weights but otherwise identical chemical properties) to identify from which sample they came. Labelled samples are then pooled and processed using a MALDI TOF/TOF technology. Peptide identification and quantitation is carried out by ProteinPilot™ Software v2.0 and assembled into a comprehensive summary of the proteins in the sample. Relative protein levels are estimated for each identified protein using a summary of the corresponding peptide levels.

The software is not perfect at determining the protein level information from the peptide level information. The process will sometimes fail to identify a protein in a sample even though the protein is present. When this happens, the protein’s abundance level will be missing, and if the sample where the protein is not identified happens to be the reference sample, then relative levels for this protein can’t be estimated for each of the three experimental samples.

Interpreting Our Relative Abundance Data

Suppose the data for two active patients showed relative abundance levels of 1.2 and 1.4 for protein X (i.e., these had 20% and 40% more of protein X than the reference sample). And suppose values of 1.1 and 1.3 were observed for samples from two patients who were not active (i.e., they had 10% and 30% higher levels of protein X than the reference samples). This would suggest that patients with active disease had protein X levels approximately 50% higher than patients who are not currently active.

Missing Values

NA has been used to indicate samples where the relative protein abundance is not available. Note that this may or may not indicate low relative abundance values. In each case, failure to detect may be more an identification problem than a problem of “absolute abundance” values below some (unknown) limit of quantification.

Obfuscation of the Data

As this dataset is from a larger study with industrial partners, we are required to respect certain intellectual property considerations. We have obfuscated the dataset by assigning arbitrary protein identifiers (BPG0001 through BPG0460) and by including synthetic samples created to have properties similar to the real samples.

We hope to be able to discuss additional context in the spring of 2009 and at the SSC meeting, but for now, we’ll even have to keep quiet about the source of the data, including the actual disease.

Research Question: 

Our long term goal is to replace the current invasive and expensive evaluation method with a biomarker based on protein levels measured in a simple blood sample. We are interested in methods to classify a new sample as coming from an active or inactive patient.



Data file in CSV format

  • Observation
  • Sex
  • Race
  • Disease Status
  • Age
  • Relative Abundance for 460 Proteins (BPG0001 through BPG0460)


We have several publications in preparation and a list of references will be added as they become available.