Context-Aware Gene Embedding Pipeline (CGEP): An Accessible, Regulatory-Region-Inclusive Embedding Framework for Interpretable Disease-Agnostic Prediction

In this study, we propose a novel Gene Embedding Pipeline (CGEP) designed to address critical limitations in genomic deep learning by integrating both functional and regulatory genomic contexts. Unlike conventional approaches, our framework prioritizes disease-associated genes identified through con...

Full description

Saved in:
Bibliographic Details
Main Authors: Twaha Ahmed Minai, Zubair Ahmed Shaikh, Asim Imdad Wagan, M. Kamran Azim, Syed Muhammad Muaz, Muhammad Shoaib Siddiqui
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11048558/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In this study, we propose a novel Gene Embedding Pipeline (CGEP) designed to address critical limitations in genomic deep learning by integrating both functional and regulatory genomic contexts. Unlike conventional approaches, our framework prioritizes disease-associated genes identified through contribution analysis (validated via a breast cancer case study) while remaining extensible to all pathologies with patient-derived FASTA sequences. The pipeline uniquely processes each target gene alongside its upstream/downstream regulatory regions (±2.5 kbp), capturing promoter/enhancer dynamics critical for disease mechanisms. To enable flexible downstream analysis, CGEP generates four distinct embeddings per gene, which may be utilized independently via task-specific models or fused for enhanced predictive power. By leveraging publicly available reference genomes (GRCh37/GRCh38) as healthy baselines, our method minimizes data procurement barriers and supports decentralized training paradigms that align with institutional data governance requirements. Implementation relies on lightweight feed-forward architectures, ensuring computational accessibility without sacrificing performance. Benchmarking against state-of-the-art models demonstrates competitive accuracy as good as (99.0%), F1-score (0.99), and AUC-ROC (0.99), with superior GC (Generalization Capacity) and reduced OL (Overfitting Likelihood). To foster reproducibility, we provide open-source access to the entire pipeline, including modular scripts for data curation, embedding generation, and model training. This work bridges computational innovation with clinical pragmatism, enabling scalable and interpretable genomic analysis for precision medicine.
ISSN:2169-3536