CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome Sequences
(1) Background: Although whole genome sequencing (WGS) has enabled the comprehensive analyses of structural variants (SVs), more accurate and efficient methods are needed to distinguish large somatic SVs (SV size ≥ 1 Mb) traditionally detected through cytogenetic testing from germline SVs. (2) Metho...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-06-01
|
Series: | Life |
Subjects: | |
Online Access: | https://www.mdpi.com/2075-1729/15/6/929 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | (1) Background: Although whole genome sequencing (WGS) has enabled the comprehensive analyses of structural variants (SVs), more accurate and efficient methods are needed to distinguish large somatic SVs (SV size ≥ 1 Mb) traditionally detected through cytogenetic testing from germline SVs. (2) Methods: A customized machine learning pipeline (CYTO-SV-ML) under Snakemake automation workflow was developed with a user interface to identify somatic cytogenetic SVs in WGS data. And this tool was applied for characterizing structural variation profiles in the whole blood of patients with myelodysplastic syndromes (MDSs). Known SVs mapped from well-established open databases were split into training and validation subsets for an AUTO-ML machine learning model in a CYTO-SV-ML pipeline. (3) Results: The benchmarking performance of the CYTO-SV-ML pipeline on somatic cytogenetic SV classification displayed an area under the receiver operating characteristic curve (AUCROC) of 0.94 for translocations and 0.92 for non-translocations, a sensitivity of 0.83 for translocations and 0.85 for non-translocations, and a specificity of 0.96 for translocations and 0.82 for non-translocations. Our method (207 somatic cytogenetic SVs) outperformed a conventional SV calling pipeline (143 somatic cytogenetic SVs) in an independent validation of clinical cytogenetic records. In addition, the CYTO-SV-ML pipeline uncovered novel somatic cytogenetic SVs in 49 (89%) of 55 patients without successful clinical cytogenetic results. (4) Conclusions: Our study demonstrates the high-performance machine learning approach of CYTO-SV-ML on benchmarking SV classification from genomic sequencing data, and further validations of novel anomalies by orthogonal methods will be essential to unlock its full clinical potential of cytogenetic diagnostics. |
---|---|
ISSN: | 2075-1729 |