CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome Sequences
(1) Background: Although whole genome sequencing (WGS) has enabled the comprehensive analyses of structural variants (SVs), more accurate and efficient methods are needed to distinguish large somatic SVs (SV size ≥ 1 Mb) traditionally detected through cytogenetic testing from germline SVs. (2) Metho...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-06-01
|
Series: | Life |
Subjects: | |
Online Access: | https://www.mdpi.com/2075-1729/15/6/929 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1839653493025013760 |
---|---|
author | Tao Zhang Paul Auer Stephen R. Spellman Jing Dong Wael Saber Yung-Tsi Bolon |
author_facet | Tao Zhang Paul Auer Stephen R. Spellman Jing Dong Wael Saber Yung-Tsi Bolon |
author_sort | Tao Zhang |
collection | DOAJ |
description | (1) Background: Although whole genome sequencing (WGS) has enabled the comprehensive analyses of structural variants (SVs), more accurate and efficient methods are needed to distinguish large somatic SVs (SV size ≥ 1 Mb) traditionally detected through cytogenetic testing from germline SVs. (2) Methods: A customized machine learning pipeline (CYTO-SV-ML) under Snakemake automation workflow was developed with a user interface to identify somatic cytogenetic SVs in WGS data. And this tool was applied for characterizing structural variation profiles in the whole blood of patients with myelodysplastic syndromes (MDSs). Known SVs mapped from well-established open databases were split into training and validation subsets for an AUTO-ML machine learning model in a CYTO-SV-ML pipeline. (3) Results: The benchmarking performance of the CYTO-SV-ML pipeline on somatic cytogenetic SV classification displayed an area under the receiver operating characteristic curve (AUCROC) of 0.94 for translocations and 0.92 for non-translocations, a sensitivity of 0.83 for translocations and 0.85 for non-translocations, and a specificity of 0.96 for translocations and 0.82 for non-translocations. Our method (207 somatic cytogenetic SVs) outperformed a conventional SV calling pipeline (143 somatic cytogenetic SVs) in an independent validation of clinical cytogenetic records. In addition, the CYTO-SV-ML pipeline uncovered novel somatic cytogenetic SVs in 49 (89%) of 55 patients without successful clinical cytogenetic results. (4) Conclusions: Our study demonstrates the high-performance machine learning approach of CYTO-SV-ML on benchmarking SV classification from genomic sequencing data, and further validations of novel anomalies by orthogonal methods will be essential to unlock its full clinical potential of cytogenetic diagnostics. |
format | Article |
id | doaj-art-a62f5b09c77f45d0afc7dcf6c9da8b35 |
institution | Matheson Library |
issn | 2075-1729 |
language | English |
publishDate | 2025-06-01 |
publisher | MDPI AG |
record_format | Article |
series | Life |
spelling | doaj-art-a62f5b09c77f45d0afc7dcf6c9da8b352025-06-25T14:05:44ZengMDPI AGLife2075-17292025-06-0115692910.3390/life15060929CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome SequencesTao Zhang0Paul Auer1Stephen R. Spellman2Jing Dong3Wael Saber4Yung-Tsi Bolon5CIBMTR<sup>®</sup> (Center for International Blood and Marrow Transplant Research), NMDP (National Marrow Donor Program), Minneapolis, MN 55401, USADivision of Biostatistics, Institute for Health and Equity, Medical College of Wisconsin, Milwaukee, WI 53226, USACIBMTR<sup>®</sup> (Center for International Blood and Marrow Transplant Research), NMDP (National Marrow Donor Program), Minneapolis, MN 55401, USAMedical College of Wisconsin Cancer Center, Milwaukee, WI 53226, USACIBMTR<sup>®</sup> (Center for International Blood and Marrow Transplant Research), Medical College of Wisconsin, Milwaukee, WI 53226, USACIBMTR<sup>®</sup> (Center for International Blood and Marrow Transplant Research), NMDP (National Marrow Donor Program), Minneapolis, MN 55401, USA(1) Background: Although whole genome sequencing (WGS) has enabled the comprehensive analyses of structural variants (SVs), more accurate and efficient methods are needed to distinguish large somatic SVs (SV size ≥ 1 Mb) traditionally detected through cytogenetic testing from germline SVs. (2) Methods: A customized machine learning pipeline (CYTO-SV-ML) under Snakemake automation workflow was developed with a user interface to identify somatic cytogenetic SVs in WGS data. And this tool was applied for characterizing structural variation profiles in the whole blood of patients with myelodysplastic syndromes (MDSs). Known SVs mapped from well-established open databases were split into training and validation subsets for an AUTO-ML machine learning model in a CYTO-SV-ML pipeline. (3) Results: The benchmarking performance of the CYTO-SV-ML pipeline on somatic cytogenetic SV classification displayed an area under the receiver operating characteristic curve (AUCROC) of 0.94 for translocations and 0.92 for non-translocations, a sensitivity of 0.83 for translocations and 0.85 for non-translocations, and a specificity of 0.96 for translocations and 0.82 for non-translocations. Our method (207 somatic cytogenetic SVs) outperformed a conventional SV calling pipeline (143 somatic cytogenetic SVs) in an independent validation of clinical cytogenetic records. In addition, the CYTO-SV-ML pipeline uncovered novel somatic cytogenetic SVs in 49 (89%) of 55 patients without successful clinical cytogenetic results. (4) Conclusions: Our study demonstrates the high-performance machine learning approach of CYTO-SV-ML on benchmarking SV classification from genomic sequencing data, and further validations of novel anomalies by orthogonal methods will be essential to unlock its full clinical potential of cytogenetic diagnostics.https://www.mdpi.com/2075-1729/15/6/929structural variantssomatic cellscytogenetic abnormalitytransplantwhole genome sequencingmachine learning |
spellingShingle | Tao Zhang Paul Auer Stephen R. Spellman Jing Dong Wael Saber Yung-Tsi Bolon CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome Sequences Life structural variants somatic cells cytogenetic abnormality transplant whole genome sequencing machine learning |
title | CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome Sequences |
title_full | CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome Sequences |
title_fullStr | CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome Sequences |
title_full_unstemmed | CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome Sequences |
title_short | CYTO-SV-ML: A Machine Learning Tool for Cytogenetic Structural Variant Analysis in Somatic Cell Type Using Genome Sequences |
title_sort | cyto sv ml a machine learning tool for cytogenetic structural variant analysis in somatic cell type using genome sequences |
topic | structural variants somatic cells cytogenetic abnormality transplant whole genome sequencing machine learning |
url | https://www.mdpi.com/2075-1729/15/6/929 |
work_keys_str_mv | AT taozhang cytosvmlamachinelearningtoolforcytogeneticstructuralvariantanalysisinsomaticcelltypeusinggenomesequences AT paulauer cytosvmlamachinelearningtoolforcytogeneticstructuralvariantanalysisinsomaticcelltypeusinggenomesequences AT stephenrspellman cytosvmlamachinelearningtoolforcytogeneticstructuralvariantanalysisinsomaticcelltypeusinggenomesequences AT jingdong cytosvmlamachinelearningtoolforcytogeneticstructuralvariantanalysisinsomaticcelltypeusinggenomesequences AT waelsaber cytosvmlamachinelearningtoolforcytogeneticstructuralvariantanalysisinsomaticcelltypeusinggenomesequences AT yungtsibolon cytosvmlamachinelearningtoolforcytogeneticstructuralvariantanalysisinsomaticcelltypeusinggenomesequences |