Purpose: An early genetic diagnosis can guide the time-sensitive treatment of individuals with genetic epilepsies. However, most genetic diagnoses occur long after disease onset. We aimed to identify early clinical features suggestive of genetic diagnoses in individuals with epilepsy through large-scale analysis of full-text electronic medical records (EMRs). Methods: We extracted 89 million time-stamped standardized clinical annotations using Natural Language Processing from 4,572,783 clinical notes from 32 112 individuals with childhood epilepsy, including 1925 individuals with known or presumed genetic epilepsies. We applied these features to train random forest models to predict SCN1A-related disorders and any genetic diagnosis. Results: We identified 47 774 age-dependent associations of clinical features with genetic etiologies a median of 3.6 years prior to molecular diagnosis. Across all 710 genetic etiologies identified in our cohort, neurodevelopmental differences between 6 and 9 months increased the likelihood of a later molecular diagnosis fivefold (P < .0001, 95% CI = 3.55-7.42). A later diagnosis of SCN1A-related disorders (AUC = 0.91) or an overall positive genetic diagnosis (AUC = 0.82) could be reliably predicted using random forest models. Conclusion: Clinical features predictive of genetic epilepsies precede molecular diagnoses by up to several years in conditions with known precision treatments. An earlier diagnosis facilitated by automated EMR analysis has the potential for earlier targeted therapeutic strategies in the genetic epilepsies.
Commentary on the above:
Li Y. Predicting Pediatric Genetic Epilepsy Through Electronic Medical Records: A Data-Driven Biomarker Discovery Approach. Epilepsy Currents. 2024;0(0). doi:10.1177/15357597241290322
With the goal to identify key clinical features linked to genetic epilepsy syndromes and predict genetic diagnoses, Galer et al extracted clinical notes from the electronic medical record (EMR) system of 32 112 individuals diagnosed with childhood epilepsy, including 1925 individuals with known or presumed genetic epilepsies at the Children's Hospital of Philadelphia Care Network between 2010 and 2022. A customized natural language processing (NLP) pipeline was utilized to help extract clinical features in the form of Language System codes. These features were subsequently mapped onto the Human Phenotype Ontology and segmented into 3-month age bins for analysis. A conservative framework was developed to analyze only clinical notes before an individual's genetic diagnosis, with additional analysis using cumulative time binning. Furthermore, validation was conducted by collecting phenotype data from individuals with SCN1A-related epilepsy disorders and control groups in two different cohorts, analyzing the most significant neurological phenotypes associated with SCN1A-related epilepsy and employing random forest models for prediction. In their study, causative genetic etiologies were found in 38% of individuals with known or presumed genetic epilepsy, involving 271 unique genes, with 87 occurring in two or more individuals. The median time from the first neurological abnormality to genetic diagnosis was 1.4 years in their cohort. The earliest clinical feature associated with a genetic diagnosis occurred a median of 3.62 years before the median age of genetic diagnosis. Furthermore, broad clinical features that predict positive genetic diagnoses independent of molecular etiology were identified, including muscular hypotonia between 1 and 1.25 years, neurodevelopmental abnormality between 6 and 9 months, and neurodevelopmental delay between 6 and 9 months.
The study offers valuable insights into the clinical applicability of predictive models for genetic diagnoses in epilepsy. The utilization of NLP allows for the extraction of data from real-world observations, facilitating the mapping of clinical phenotype trajectories in genetic epilepsies over time. This not only tracks the natural history overtime but also enables the identification of novel pathognomonic clinical features. Such an approach is especially beneficial for rare genetic disorders, enabling the discovery of unprecedented details that may have been previously overlooked. Additionally, the study highlights the promising combination of NLP with machine learning models to identify significant clinical phenotypes. This integrated approach may aid in predicting genetic diagnoses at an earlier age, offering potential for the application of precision medicine in epilepsy care.
Early recognition of diagnosis and optimized treatment has been one of the fundamental objectives in medical care to improve patient outcomes and enhance overall healthcare cost-effectiveness. Large-scale modeling of EMR trajectories have been developed for various common medical conditions such as sepsis, heart failure, and cancer, among others. These models leverage current advancements in large language models and deep learning technologies to drive forward the field of precision medicine. While early diagnosis of genetic epilepsies is crucial for timely treatment, the practicality and cost-effectiveness of such an approach would need further research. It is anticipated that clinical features or a combination thereof could be utilized to identify patients highly likely to have a genetic cause, prompting further genetic testing for confirmation or even consideration of empirical treatment when genetic testing is not an option in resource-limited scenarios. However, there is a need for ongoing evaluation of the potential for false positive identifications within EMR systems based on this proposed algorithm. Additionally, the cost implications and overall cost-effectiveness of these approaches warrant further investigation.
Furthermore, when considering the application of these discoveries beyond SCN1A syndromes, integrating them into clinical practice may encounter limitations in generalizability resulting from data heterogeneity or data insufficiency, which is one of the common challenges for prediction models based on EMR trajectories. The algorithms trained on pediatric epilepsy patients within a tertiary center network may present variations in specific terms or syndromes compared to the documentation practices of general neurologists or adult neurologists. These discrepancies could be from less detailed history reviews by general providers or inadequate data due to recall bias among patients and their guardians, ultimately leading to underdocumented clinical symptoms. For instance, specific symptoms like muscular hypotonia at a younger age between 1 and 1.25 years, identified as an independent clinical biomarker for genetic epilepsy diagnosis in this study, might not be consistently recorded due to recall bias related to their remote occurrence within families in adult neurology practice. Additionally, adult-onset genetic epilepsies exhibit unique genetic mutations and clinical features that can notably differ from those observed in childhood epilepsy cases. Therefore, further expansion of the training and application of similar methodologies across diverse populations holds significant promise in offering valuable insights in the field.