CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome Sequences

(1) Background: Although whole genome sequencing (WGS) has enabled the comprehensive analyses of structural variants (SVs), more accurate and efficient methods are needed to distinguish large somatic SVs (SV size ≥ 1 Mb) traditionally detected through cytogenetic testing from germline SVs. (2) Metho...

Full description

Saved in:
Bibliographic Details
Main Authors: Tao Zhang, Paul Auer, Stephen R. Spellman, Jing Dong, Wael Saber, Yung-Tsi Bolon
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Life
Subjects:
Online Access:https://www.mdpi.com/2075-1729/15/6/929
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1839653493025013760
author Tao Zhang
Paul Auer
Stephen R. Spellman
Jing Dong
Wael Saber
Yung-Tsi Bolon
author_facet Tao Zhang
Paul Auer
Stephen R. Spellman
Jing Dong
Wael Saber
Yung-Tsi Bolon
author_sort Tao Zhang
collection DOAJ
description (1) Background: Although whole genome sequencing (WGS) has enabled the comprehensive analyses of structural variants (SVs), more accurate and efficient methods are needed to distinguish large somatic SVs (SV size ≥ 1 Mb) traditionally detected through cytogenetic testing from germline SVs. (2) Methods: A customized machine learning pipeline (CYTO-SV-ML) under Snakemake automation workflow was developed with a user interface to identify somatic cytogenetic SVs in WGS data. And this tool was applied for characterizing structural variation profiles in the whole blood of patients with myelodysplastic syndromes (MDSs). Known SVs mapped from well-established open databases were split into training and validation subsets for an AUTO-ML machine learning model in a CYTO-SV-ML pipeline. (3) Results: The benchmarking performance of the CYTO-SV-ML pipeline on somatic cytogenetic SV classification displayed an area under the receiver operating characteristic curve (AUCROC) of 0.94 for translocations and 0.92 for non-translocations, a sensitivity of 0.83 for translocations and 0.85 for non-translocations, and a specificity of 0.96 for translocations and 0.82 for non-translocations. Our method (207 somatic cytogenetic SVs) outperformed a conventional SV calling pipeline (143 somatic cytogenetic SVs) in an independent validation of clinical cytogenetic records. In addition, the CYTO-SV-ML pipeline uncovered novel somatic cytogenetic SVs in 49 (89%) of 55 patients without successful clinical cytogenetic results. (4) Conclusions: Our study demonstrates the high-performance machine learning approach of CYTO-SV-ML on benchmarking SV classification from genomic sequencing data, and further validations of novel anomalies by orthogonal methods will be essential to unlock its full clinical potential of cytogenetic diagnostics.
format Article
id doaj-art-a62f5b09c77f45d0afc7dcf6c9da8b35
institution Matheson Library
issn 2075-1729
language English
publishDate 2025-06-01
publisher MDPI AG
record_format Article
series Life
spelling doaj-art-a62f5b09c77f45d0afc7dcf6c9da8b352025-06-25T14:05:44ZengMDPI AGLife2075-17292025-06-0115692910.3390/life15060929CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome SequencesTao Zhang0Paul Auer1Stephen R. Spellman2Jing Dong3Wael Saber4Yung-Tsi Bolon5CIBMTR<sup>®</sup> (Center for International Blood and Marrow Transplant Research), NMDP (National Marrow Donor Program), Minneapolis, MN 55401, USADivision of Biostatistics, Institute for Health and Equity, Medical College of Wisconsin, Milwaukee, WI 53226, USACIBMTR<sup>®</sup> (Center for International Blood and Marrow Transplant Research), NMDP (National Marrow Donor Program), Minneapolis, MN 55401, USAMedical College of Wisconsin Cancer Center, Milwaukee, WI 53226, USACIBMTR<sup>®</sup> (Center for International Blood and Marrow Transplant Research), Medical College of Wisconsin, Milwaukee, WI 53226, USACIBMTR<sup>®</sup> (Center for International Blood and Marrow Transplant Research), NMDP (National Marrow Donor Program), Minneapolis, MN 55401, USA(1) Background: Although whole genome sequencing (WGS) has enabled the comprehensive analyses of structural variants (SVs), more accurate and efficient methods are needed to distinguish large somatic SVs (SV size ≥ 1 Mb) traditionally detected through cytogenetic testing from germline SVs. (2) Methods: A customized machine learning pipeline (CYTO-SV-ML) under Snakemake automation workflow was developed with a user interface to identify somatic cytogenetic SVs in WGS data. And this tool was applied for characterizing structural variation profiles in the whole blood of patients with myelodysplastic syndromes (MDSs). Known SVs mapped from well-established open databases were split into training and validation subsets for an AUTO-ML machine learning model in a CYTO-SV-ML pipeline. (3) Results: The benchmarking performance of the CYTO-SV-ML pipeline on somatic cytogenetic SV classification displayed an area under the receiver operating characteristic curve (AUCROC) of 0.94 for translocations and 0.92 for non-translocations, a sensitivity of 0.83 for translocations and 0.85 for non-translocations, and a specificity of 0.96 for translocations and 0.82 for non-translocations. Our method (207 somatic cytogenetic SVs) outperformed a conventional SV calling pipeline (143 somatic cytogenetic SVs) in an independent validation of clinical cytogenetic records. In addition, the CYTO-SV-ML pipeline uncovered novel somatic cytogenetic SVs in 49 (89%) of 55 patients without successful clinical cytogenetic results. (4) Conclusions: Our study demonstrates the high-performance machine learning approach of CYTO-SV-ML on benchmarking SV classification from genomic sequencing data, and further validations of novel anomalies by orthogonal methods will be essential to unlock its full clinical potential of cytogenetic diagnostics.https://www.mdpi.com/2075-1729/15/6/929structural variantssomatic cellscytogenetic abnormalitytransplantwhole genome sequencingmachine learning
spellingShingle Tao Zhang
Paul Auer
Stephen R. Spellman
Jing Dong
Wael Saber
Yung-Tsi Bolon
CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome Sequences
Life
structural variants
somatic cells
cytogenetic abnormality
transplant
whole genome sequencing
machine learning
title CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome Sequences
title_full CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome Sequences
title_fullStr CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome Sequences
title_full_unstemmed CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome Sequences
title_short CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome Sequences
title_sort cyto sv ml a machine learning tool for cytogenetic structural variant analysis in somatic cell type using genome sequences
topic structural variants
somatic cells
cytogenetic abnormality
transplant
whole genome sequencing
machine learning
url https://www.mdpi.com/2075-1729/15/6/929
work_keys_str_mv AT taozhang cytosvmlamachinelearningtoolforcytogeneticstructuralvariantanalysisinsomaticcelltypeusinggenomesequences
AT paulauer cytosvmlamachinelearningtoolforcytogeneticstructuralvariantanalysisinsomaticcelltypeusinggenomesequences
AT stephenrspellman cytosvmlamachinelearningtoolforcytogeneticstructuralvariantanalysisinsomaticcelltypeusinggenomesequences
AT jingdong cytosvmlamachinelearningtoolforcytogeneticstructuralvariantanalysisinsomaticcelltypeusinggenomesequences
AT waelsaber cytosvmlamachinelearningtoolforcytogeneticstructuralvariantanalysisinsomaticcelltypeusinggenomesequences
AT yungtsibolon cytosvmlamachinelearningtoolforcytogeneticstructuralvariantanalysisinsomaticcelltypeusinggenomesequences