A natural language processing pipeline for identifying pediatric Long COVID symptoms and functional impacts in freeform clinical notes: A RECOVER study

Bunnell, HT; Reedy, C; Lorman, V; et al., JAMIA Open, October 2025

View Publication on PubMed

October 2025
JAMIA Open

Short Summary

In this RECOVER study, researchers wanted to find out if natural language processing (NLP) could be used to identify Long COVID symptoms in children. NLP is a tool that can help find details in electronic health records (EHRs) beyond what is usually looked at in EHRs, such as diagnosis or billing codes (known as standard EHR data). Researchers used an NLP tool to look for 25 signs of Long COVID in children: 21 symptoms (like pain or extreme tiredness) and 4 types of daily life challenges (such as trouble with school). They compared children who had been diagnosed with Long COVID to those who had COVID-19 but did not develop Long COVID. The NLP tool analyzed more than 48,000 doctors’ notes within the EHRs of more than 10,000 children across 12 hospitals. Researchers found that the NLP tool identified almost all 25 symptoms much more often in the children who had Long COVID. The NLP tool also often identified patients’ symptoms that were not recognized when researchers only looked at standard EHR data. The study shows that using NLP to read EHR notes can help researchers better understand the symptoms and daily challenges that children with Long COVID experience when compared to looking only at codes and medication lists. This supports the idea that NLP should be used when doing scientific studies that need to identify children with Long COVID.

This summary was prepared by the RECOVER Initiative.

Publication Details

DOI: 10.1093/jamiaopen/ooaf089

Abstract

Objective: To develop a natural language processing (NLP) pipeline for unstructured electronic health record (EHR) data to identify symptoms and functional impacts associated with Long COVID in children.

Materials and methods: We analyzed 48,287 outpatient progress notes from 10,618 pediatric patients from 12 institutions. We evaluated notes obtained 28 to 179 days after a COVID-19 diagnosis or positive test. Two samples were examined: patients with evidence of Long COVID and patients with acute COVID but no evidence of Long COVID based on diagnostic codes. The pipeline identified clinical concepts associated with 21 symptoms and 4 functional impact categories. Subject matter experts (SMEs) screened a sample of 4,586 terms from the NLP output to assess pipeline accuracy. Prevalence and concordance of each of the 25 concepts was compared between the 2 patient samples.

Results: A binary assertion measure comparing SME and NLP assertions showed moderate accuracy (N = 4,133; F1 = .80) and improved substantially when only high-confidence SME assertions were considered (N = 2,043; F1 = .90). Overall, the 25 Long COVID concept categories were markedly more prevalent in the presumptive Long COVID cohort, and differences were noted between concepts identified in notes versus structured data.

Discussion: This preliminary analysis illustrates the additional insight into a syndrome such as Long COVID gained from incorporating notes data, characterizing symptoms and functional impacts.

Conclusion: These data support the importance of incorporating NLP methodology when possible into designing computable phenotypes and to accurately characterize patients with Long COVID.

Authors

H Timothy Bunnell, Cara Reedy, Vitaly Lorman, Ravi Jhaveri, Andrea Rivera-Sepulveda, Katherine S Salamon, Payal B Patel, Keith E Morse, Mattina A Davenport, Lindsay G Cowell, Levon Utidjian, Dimitri A Christakis, Suchitra Rao, Marion R Sills, Abigail Case, Eneida A Mendonca, Bradley W Taylor, Jacqueline Rutter, Aaron Thomas Martinez, Rebecca Letts, L Charles Bailey, Christopher B Forrest, RECOVER Consortium

Keywords

NLP; PEDSnet; RECOVER; pediatrics