Size or diversity? Synthetic dataset recommendations for machine learning heating energy prediction models in early design stages for residential buildings
One promising means to reduce building energy for a more sustainable environment is to conduct early-stage building energy optimization using simulation, yet today’s simulation engines are computationally intensive. Recently, machine learning (ML) energy prediction models have shown promise in repla...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2025-09-01
|
Series: | Energy and AI |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2666546825000898 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | One promising means to reduce building energy for a more sustainable environment is to conduct early-stage building energy optimization using simulation, yet today’s simulation engines are computationally intensive. Recently, machine learning (ML) energy prediction models have shown promise in replacing these simulation engines. However, it is often difficult to develop such ML models due to the lack of proper datasets. Synthetic datasets can provide a solution, but determining the optimal quantity and diversity of synthetic data remains a challenging task. Furthermore, there is a lack of understanding of the compatibility between different ML algorithms and the characteristics of synthetic datasets. To fill these gaps, this study conducted multiple ML experiments using residential buildings in Sweden to determine the best-performing ML algorithm, as well as the characteristics of the corresponding synthetic dataset. A parametric model was developed to generate a wide range of synthetic datasets varying in size and building shape, referred to as diversity. Five ML algorithms selected through a literature review were trained using the different datasets. Results show that the Support Vector Machine performed the best overall. Multiple Linear Regression performed well with small and low-diverse datasets, while the Artificial Neural Network performed well with large and high-diverse datasets. We conclude that developers should focus more on increasing diversity instead of size once the dataset size reaches around 1440 when generating synthetic training datasets. This study offers insights for researchers and practitioners, such as software tool developers, when developing ML building energy prediction models in early-stage optimization. |
---|---|
ISSN: | 2666-5468 |