Interpretable Machine Learning Models and Symbolic Regressions Reveal Transfer of Per- and Polyfluoroalkyl Substances (PFASs) in Plants: A New Small-Data Machine Learning Method to Augment Data and Obtain Predictive Equations

Machine learning (ML) techniques are becoming increasingly valuable for modeling the transport of pollutants in plant systems. However, two challenges (small sample sizes and a lack of quantitative calculation functions) remain when using ML to predict migration in hydroponic systems. For the bioacc...

Full beskrivning

Sparad:
Bibliografiska uppgifter
Huvudupphovsmän: Yuan Zhang, Yanting Li, Yang Li, Lin Zhao, Yongkui Yang
Materialtyp: Artikel
Språk:engelska
Publicerad: MDPI AG 2025-07-01
Serie:Toxics
Ämnen:
Länkar:https://www.mdpi.com/2305-6304/13/7/579
Taggar: Lägg till en tagg
Inga taggar, Lägg till första taggen!
Beskrivning
Sammanfattning:Machine learning (ML) techniques are becoming increasingly valuable for modeling the transport of pollutants in plant systems. However, two challenges (small sample sizes and a lack of quantitative calculation functions) remain when using ML to predict migration in hydroponic systems. For the bioaccumulation of per- and polyfluoroalkyl substances, we studied the key factors and quantitative calculation equations based on data augmentation, ML, and symbolic regression. First, feature expansion was performed on the input data after data preprocessing; the most important step was data augmentation. The original training set was expanded nine times by combining the synthetic minority oversampling technique and a variational autoencoder. Subsequently, the four ML models were applied to the test set to predict the selected output parameters. Categorical boosting (CatBoost) had the highest prediction accuracy (<i>R</i><sup>2</sup> = 0.83). The Shapley Additive Explanation values indicated that molecular weight and exposure time were the most important parameters. We applied three symbolic regression models to obtain accurate prediction equations based on the original and augmented data. Based on augmented data, the high-dimensional sparse interaction equation exhibited the highest accuracy (<i>R</i><sup>2</sup> = 0.776). Our results indicate that this method could provide crucial insights into absorption and accumulation in plant roots.
ISSN:2305-6304