Skip to main content

De-black-boxing health AI: Demonstrating reproducible machine learning computable phenotypes using the N3C-RECOVER Long COVID model in the All of Us data repository

Pfaff, ER; Girvin, AT; Crosskey, M; et al., Journal of American Medical Informatics Association

View Full Publication on PubMed

Published

June 2023

Journal

Journal of American Medical Informatics Association

Abstract

Machine learning (ML)-driven computable phenotypes are among the most challenging to share and reproduce. Despite this difficulty, the urgent public health considerations around Long COVID make it especially important to ensure the rigor and reproducibility of Long COVID phenotyping algorithms such that they can be made available to a broad audience of researchers. As part of the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative, researchers with the National COVID Cohort Collaborative (N3C) devised and trained an ML-based phenotype to identify patients highly probable to have Long COVID. Supported by RECOVER, N3C and NIH's All of Us study partnered to reproduce the output of N3C's trained model in the All of Us data enclave, demonstrating model extensibility in multiple environments. This case study in ML-based phenotype reuse illustrates how open-source software best practices and cross-site collaboration can de-black-box phenotyping algorithms, prevent unnecessary rework, and promote open science in informatics.

Authors

Emily R Pfaff, Andrew T Girvin, Miles Crosskey, Srushti Gangireddy, Hiral Master, Wei-Qi Wei, V Eric Kerchberger, Mark Weiner, Paul A Harris, Melissa Basford, Chris Lunt, Christopher G Chute, Richard A Moffitt, Melissa Haendel; N3C and RECOVER Consortia

Keywords

SARS-CoV-2; electronic health records; machine learning; phenotype

Short Summary

A quick way for scientists to identify patterns in a large set of data is by teaching computers to find those patterns for them. To do this, the scientist creates a set of instructions for a computer to follow, called an algorithm, to locate exactly what the scientist is looking for. When that algorithm is plugged into a software program, a computer can run the algorithm many times and learn from it. This process improves the algorithm and it becomes more accurate over time. This is called machine learning. Machine learning can be very helpful and accurate, but it is also challenging to share between researchers because each computer can learn things on its own that can be difficult to recreate. When a computer arrives at an answer that a scientist cannot recreate, it is known as a black-box algorithm.

In order for one research team to share their findings with another research team, they must be able to recreate the steps of the algorithm to “de-black-box" it. This was done by researchers from the National COVID Cohort Collaborative (N3C) as part of the National Institutes of Health (NIH) RECOVER Initiative. The N3C team first identified a phenotype (measured or visible traits) for patients who were at a higher risk of developing Long COVID by creating and training a machine learning-based algorithm. Then, with RECOVER’s support, the N3C researchers worked with researchers from another NIH study called All of Us. They were able to re-create the machine learning-based algorithm, leading to the same phenotype. This means that they were able to “de-black-box" their algorithm. This case can be used as a guide of best practices when sharing algorithm data between research teams. This way, algorithms can be used by many research teams to better understand Long COVID.

Resources

Tags

Findings
Summary
Back to Top