A machine learning-based phenotype for long COVID in children: An EHR-based study from the RECOVER program
Lorman, V; Razzaghi, H; Song, X; et. al., PLOS ONE
Published
August 2023
Journal
PLOS ONE
Abstract
As clinical understanding of pediatric Post-Acute Sequelae of SARS CoV-2 (PASC) develops, and hence the clinical definition evolves, it is desirable to have a method to reliably identify patients who are likely to have post-acute sequelae of SARS CoV-2 (PASC) in health systems data. In this study, we developed and validated a machine learning algorithm to classify which patients have PASC (distinguishing between Multisystem Inflammatory Syndrome in Children (MIS-C) and non-MIS-C variants) from a cohort of patients with positive SARS- CoV-2 test results in pediatric health systems within the PEDSnet EHR network. Patient features included in the model were selected from conditions, procedures, performance of diagnostic testing, and medications using a tree-based scan statistic approach. We used an XGboost model, with hyperparameters selected through cross-validated grid search, and model performance was assessed using 5-fold cross-validation. Model predictions and feature importance were evaluated using Shapley Additive exPlanation (SHAP) values. The model provides a tool for identifying patients with PASC and an approach to characterizing PASC using diagnosis, medication, laboratory, and procedure features in health systems data. Using appropriate threshold settings, the model can be used to identify PASC patients in health systems data at higher precision for inclusion in studies or at higher recall in screening for clinical trials, especially in settings where PASC diagnosis codes are used less frequently or less reliably. Analysis of how specific features contribute to the classification process may assist in gaining a better understanding of features that are associated with PASC diagnoses.
Authors
Vitaly Lorman, Hanieh Razzaghi, Xing Song, Keith Morse, Levon Utidjian, Andrea J Allen, Suchitra Rao, Colin Rogerson, Tellen D Bennett, Hiroki Morizono, Daniel Eckrich, Ravi Jhaveri, Yungui Huang, Daksha Ranade, Nathan Pajor, Grace M Lee, Christopher B Forrest, L Charles Bailey
Keywords
Child; Humans; Post-Acute COVID-19 Syndrome; COVID-19/diagnosis; SARS-CoV-2; Disease Progression; Machine Learning; Phenotype
Short Summary
To understand Long COVID, researchers must be able to figure out which patients have it. Our understanding of Long COVID is evolving and it has been difficult to know who had Long COVID, especially in children. We need a reliable method to identify who might have Long COVID using existing health data.
The purpose of this study was to create and test a computer program, called an algorithm, to find out which children have Long COVID based on their electronic health records (EHRs). EHRs (digital medical charts that have health data like doctor visits, lab results, and other health history) are an important source of data for research studies on Long COVID. The algorithm looks at EHRs to find patterns in the diagnoses, prescribed medications, procedures, and lab tests children received after having COVID-19. These patterns can be described as a phenotype, or a set of measured or visible traits, that can tell us who had Long COVID.
The algorithm correctly identified 67% of the patients who had a Long COVID diagnosis from the EHRs. Among the patients who the algorithm said had Long COVID, 91% had a Long COVID diagnosis. Overall, the algorithm was correct in identifying whether a patient had a Long COVID diagnosis 99% of the time. This means the phenotype can be used to recognize which children have Long COVID in EHR data for future studies, or to screen patients to participate in clinical trials.