Re-engineering a machine learning phenotype to adapt to the changing COVID-19 landscape: A machine learning modelling study from the N3C and RECOVER consortia

Crosskey, M; McIntee, T; Preiss, S; et al., The Lancet Digital Health, August 2025

View Publication on PubMed

August 2025
The Lancet Digital Health

Short Summary

In this RECOVER study, researchers wanted to update a smart computer program, called a machine learning pipeline, to better identify people with Long COVID. In 2021, the first version of the program, called LCM 1, was created to identify people with or likely to have Long COVID. LCM 1 depended on people having a COVID-19 diagnosis date in their electronic health records (EHR). This meant LCM 1 could miss people who may have taken a COVID-19 test at home. LCM 1 also did not look at information about whether people got COVID-19 more than once. To improve the program and create a new version called LCM 2, researchers used more than 5 million EHRs from a large set of data called the National COVID Cohort Collaborative (N3C). They taught the program to look at a person’s health information over many years, not just starting from their first recorded COVID-19 diagnosis. Researchers found that LCM 2 was very accurate. They used it to estimate that about 1 in 10 people in the database who had COVID-19 went on to develop Long COVID. This study is important because it shows that older machine learning models, like LCM 1, can be updated to keep up with the way an illness is tracked and diagnosed over time. This can help other researchers improve their machine learning models to produce more accurate findings.

This summary was prepared by the RECOVER Initiative.

Publication Details

DOI: 10.1016/j.landig.2025.100887

Abstract

Background: In 2021, we used the National COVID Cohort Collaborative (N3C) as part of the National Institutes of Health RECOVER Initiative to develop a machine learning pipeline to identify patients with a high probability of having post-acute sequelae of SARS-CoV-2 infection or long COVID. However, the increased home testing, missing documentation, and reinfections that characterise the pandemic beyond 2022 necessitated the re-engineering of our original model to account for these changes in the COVID-19 research landscape.

Methods: Trained on 72,745 patient records (36,238 with long COVID and 36,507 with no evidence of long COVID), our updated XGBoost model gathered data for each patient in overlapping 100-day periods that progressed through time and issued a probability of long COVID for each 100-day period. We ran the model on patients in N3C (n=5,875,065) who met at least one of the following criteria from Jan 1, 2020, to June 22, 2023: a U07·1 (COVID-19) diagnosis code; a positive SARS-CoV-2 test; a U09·9 (post-acute sequelae of SARS-CoV-2 infection) diagnosis code; a prescription for nirmatrelvir-ritonavir or remdesivir; or an M35·81 (multisystem inflammatory syndrome in children [MIS-C]) diagnosis code. Each patient was given a model score that predicted long COVID status for each 100-day window in which they were aged ≥18 years. If a patient had known acute COVID-19 during any 100-day window (including reinfections), we censored the data from 7 days before the diagnosis or positive test date to 28 days after. We ran the model on controls selected from pre-2020 data to assess the likelihood of false positives.

Findings: The updated model had an area under the receiver operating characteristic curve of 0·90. Precision and recall could be adjusted according to a given use case, depending on whether greater sensitivity or specificity was warranted. Using our model, we estimate the overall prevalence of long COVID among the COVID-19 positive cohort within N3C repository to be 10.4%.

Interpretation: By eschewing the COVID-19 index date as an anchor point for analysis, we can assess the probability of long COVID among patients who might have tested at home, or with suspected (but untested) cases of COVID-19, or multiple SARS-CoV-2 reinfections. We view this exercise as a model for maintaining and updating any machine learning pipeline used for clinical research and operations.

Funding: National Institutes of Health RECOVER Initiative.

Authors

Miles Crosskey, Tomas McIntee, Sandy Preiss, Daniel Brannock, John M Baratta, Yun Jae Yoo, Emily Hadley, Frank Blanceró, Robert Chew, Johanna Loomba, Abhishek Bhatia, Christopher G Chute, Melissa Haendel, Richard Moffitt, Emily R Pfaff, N3C Consortium and the RECOVER EHR cohort

Keywords

Not available