Clustering for Cardiovascular Risk Prediction in Type 2 Diabetes Using Routinely Reported ECG and EHR Data

2026

Data Source

Dataset: Single csv file (tabular format)

Organizer

Dr. Dina Labib, Dr. James White; Libin Cardiovascular Institute and Nelson PULSE Centre of the Cumming School of Medicine, University of Calgary

Le français suit

Background

Diabetes is a growing global health problem, affecting approximately 830 million individuals world-wide in 2022 (1), with type 2 diabetes accounting for >90% of all cases (2). Diabetes is a major cardiovascular risk factor, with diabetic adults having a two- to four-fold higher cardiovascular risk compared to those without diabetes. Despite advances in care, existing risk prediction models based on demographic and clinical variables remain limited in their ability to accurately identify patients at highest cardiovascular risk.

Routine clinical care generates extensive data, including laboratory tests, diagnostic codes, medication records, and electrocardiographic (ECG) variables. However, these data sources are rarely integrated for disease characterization or risk stratification in type 2 diabetes. A recently developed risk score in these patients based on a traditional Cox model to predict a composite outcome of non-fatal acute myocardial infarction, non-fatal stroke, and all-cause mortality achieved modest performance, with a C-index of 0.74.(3)
This model used 30 routinely collected variables from claims data, inclusive of demographics, cardiovascular risk factors, prior events, and medications.

Clustering, an unsupervised machine learning technique, can uncover hidden structure within complex datasets and has shown promise in identifying phenotypic subgroups across cardiovascular and metabolic disorders. Such approaches may reveal clinically meaningful distinctions that traditional modeling methods overlook.

Few studies have applied clustering specifically to patients with type 2 diabetes, and have typically relied on relatively small cohorts, limited variable sets, or inclusion of advanced biomarkers not routinely available in clinical care. (4–6) Combining routinely captured electronic health record (EHR) data with standard 12-lead ECG variables offers a unique opportunity to identify novel patient subgroups that differ in cardiovascular risk and long-term outcomes. Insights from such analyses could enhance patient stratification, guide personalized management, and inform future prediction models for major adverse cardiovascular events (MACE).

**********************

Contexte

Le diabète est un problème de santé mondial croissant qui touchera environ 830 millions de personnes dans le monde en 2022 (1), le diabète de type 2 représentant plus de 90 % de tous les cas (2). Le diabète est un facteur de risque cardiovasculaire majeur, les adultes diabétiques présentant un risque cardiovasculaire deux à quatre fois plus élevé que les personnes non diabétiques. Malgré les progrès réalisés en matière de soins, les modèles de prédiction des risques existants, basés sur des variables démographiques et cliniques, restent limités dans leur capacité à identifier avec précision les patients présentant le risque cardiovasculaire le plus élevé.

Les soins cliniques de routine génèrent de nombreuses données, notamment des tests de laboratoire, des codes de diagnostic, des dossiers médicaux et des variables électrocardiographiques (ECG). Cependant, ces sources de données sont rarement intégrées pour la caractérisation de la maladie ou la stratification des risques dans le diabète de type 2. Un score de risque récemment développé chez ces patients, basé sur un modèle Cox traditionnel pour prédire un résultat composite d’infarctus aigu du myocarde non mortel, d’accident vasculaire cérébral non mortel et de mortalité toutes causes confondues, a obtenu des performances modestes, avec un indice C de 0,74 (3).
Ce modèle utilisait 30 variables collectées régulièrement à partir des données de remboursement, incluant notamment données démographiques, facteurs de risque cardiovasculaire, événements antérieurs et médicaments.

Le clustering, une technique d’apprentissage automatique non supervisée, permet de mettre au jour des structures cachées dans des ensembles de données complexes et s’est révélé prometteur pour identifier des sous-groupes phénotypiques parmi les troubles cardiovasculaires et métaboliques. De telles approches peuvent révéler des distinctions cliniquement significatives que les méthodes de modélisation traditionnelles négligent.

Peu d’études ont appliqué le clustering spécifiquement aux patients atteints de diabète de type 2, et se sont généralement appuyées sur des cohortes relativement petites, des ensembles de variables limités ou l’inclusion de biomarqueurs avancés qui ne sont pas couramment disponibles dans les soins cliniques. (4-6) La combinaison des données issues des dossiers médicaux électroniques (DME) collectées de manière systématique avec les variables standard de l’ECG à 12 dérivations offre une occasion unique d’identifier de nouveaux sous-groupes de patients qui diffèrent en termes de risque cardiovasculaire et de résultats à long terme. Les informations tirées de ces analyses pourraient améliorer la stratification des patients, orienter la prise en charge personnalisée et éclairer les futurs modèles de prédiction des événements cardiovasculaires majeurs (ECM).

Research Question

Challenging question
Using a large repository of synthetic patient health data—including EHR and routinely reported 12-lead ECG variables—from patients with type 2 diabetes, can you perform clustering analysis and demonstrate its value for accurate prediction of MACE in individual patients?

Notes:

Clustering should incorporate both EHR and ECG data.
A key requirement is to demonstrate clinical utility by developing a method to assign cluster membership for new (unseen) patients.
Must show the added value of cluster membership for predicting MACE.
For MACE prediction, both classification and survival-based approaches are welcomed.
Model performance should be evaluated for short-term (1 year), intermediate-term (3 years), and long-term (5 years) horizons using appropriate metrics.

Variables

Date source and access
Dataset: Single csv file (tabular format)

Study cohort: A synthetic cohort of approximately 100,000 patients with a diagnosis of type 2 diabetes (ICD-10-CA coded) who had a baseline ECG performed between January 2010 and January 2023, with a minimum follow-up of 12 months.

Outcome of interest: MACE, defined as a composite of heart failure hospitalization, acute coronary syndrome, ventricular arrhythmias, ischemic stroke, and all-cause mortality.

Features: Core demographic variables (age and sex); ICD-10-CA-coded baseline comorbidities and cardiac history; procedural codes for prior cardiac interventions; routinely reported ECG variables; laboratory test results captured around the time of ECG; and active cardiac medications prescribed at the time of the baseline ECG.

Note: Raw vector data of ECG’s are not being made available for this challenge.

Data dictionary: Please see this link for a full list of variables and definitions.

Data access: A non-disclosure agreement will be signed by all participating teams, followed by granting access to the dataset hosted in a secure password-protected online environment. The dataset will be made available on January 15, 2026.

References

NCD Risk Factor Collaboration (NCD-RisC). Worldwide trends in diabetes prevalence and treatment from 1990 to 2022: a pooled analysis of 1108 population-representative studies with 141 million participants. Lancet. 2024 Nov 23;404(10467):2077–93.
Green A, Hede SM, Patterson CC, Wild SH, Imperatore G, Roglic G, et al. Type 1 diabetes in 2017: global estimates of incident and prevalent cases in children and adults. Diabetologia. 2021 Dec;64(12):2741–50.
McCoy RG, Swarna KS, Deng Y, Herrin JS, Ross JS, Kent DM, et al. Derivation of an Annualized Claims-Based Major Adverse Cardiovascular Event Estimator in Type 2 Diabetes. JACC: Advances. 2024 Apr;3(4):100852.
Ahlqvist E, Storm P, Käräjämäki A, Martinell M, Dorkhan M, Carlsson A, et al. Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables. Lancet Diabetes Endocrinol. 2018 May;6(5):361–9.
Kahkoska AR, Geybels MS, Klein KR, Kreiner FF, Marx N, Nauck MA, et al. Validation of distinct type 2 diabetes clusters and their association with diabetes complications in the DEVOTE, LEADER and SUSTAIN ‐6 cardiovascular outcomes trials. Diabetes Obes Metab. 2020 Sep 18;22(9):1537–47.
Preechasuk L, Khaedon N, Lapinee V, Tangjittipokin W, Srivanichakorn W, Sriwijitkamol A, et al. Cluster analysis of Thai patients with newly diagnosed type 2 diabetes mellitus to predict disease progression and treatment outcomes : A prospective cohort study. BMJ Open Diabetes Res Care. 2022 Dec;10(6).

Grading points
Your case study report and poster must include:

The research question(s) you sought to address with your analysis.
A discussion on the impact of your assumptions and parameters and the limitations of these types of models.
At least one visualization needs to be included.
A summary of the key takeaways from your analysis.

The case study competition will be evaluated as follows:

Creative visualizations of the data (23%)
Appropriateness, creativity, and understanding of the strengths and limitations of the model proposed (50%)
Model performance as assessed by appropriate performance metrics (5%).
Quality and clarity of presentation (22%)

Award information
We are pleased to announce that the winning team will receive an award of $3,000. In addition to the financial award, there may be potential for research opportunities/collaborations for the successful team members.

Acknowledgment
This case study was prepared by Dr. James White, Dr. Dina Labib, and Ms Jacqueline Flewitt, with help and guidance from the Case Study Committee of the Statistical Society of Canada. Financial and infrastructure support was provided by the Libin Precision Medicine Initiative, a program enabled by the Nelson PULSE Centre of the Cumming School of Medicine, University of Calgary. Any concerns and questions can be directed to the chair of the Case Study Committee of the Statistical Society of Canada, Dr. Chel Hee Lee, via email, chelhee.lee@ucalgary.ca.