Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic Prediction
In this study, we propose a novel Gene Embedding Pipeline (CGEP) designed to address critical limitations in genomic deep learning by integrating both functional and regulatory genomic contexts. Unlike conventional approaches, our framework prioritizes disease-associated genes identified through con...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/11048558/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In this study, we propose a novel Gene Embedding Pipeline (CGEP) designed to address critical limitations in genomic deep learning by integrating both functional and regulatory genomic contexts. Unlike conventional approaches, our framework prioritizes disease-associated genes identified through contribution analysis (validated via a breast cancer case study) while remaining extensible to all pathologies with patient-derived FASTA sequences. The pipeline uniquely processes each target gene alongside its upstream/downstream regulatory regions (±2.5 kbp), capturing promoter/enhancer dynamics critical for disease mechanisms. To enable flexible downstream analysis, CGEP generates four distinct embeddings per gene, which may be utilized independently via task-specific models or fused for enhanced predictive power. By leveraging publicly available reference genomes (GRCh37/GRCh38) as healthy baselines, our method minimizes data procurement barriers and supports decentralized training paradigms that align with institutional data governance requirements. Implementation relies on lightweight feed-forward architectures, ensuring computational accessibility without sacrificing performance. Benchmarking against state-of-the-art models demonstrates competitive accuracy as good as (99.0%), F1-score (0.99), and AUC-ROC (0.99), with superior GC (Generalization Capacity) and reduced OL (Overfitting Likelihood). To foster reproducibility, we provide open-source access to the entire pipeline, including modular scripts for data curation, embedding generation, and model training. This work bridges computational innovation with clinical pragmatism, enabling scalable and interpretable genomic analysis for precision medicine. |
---|---|
ISSN: | 2169-3536 |