Size or diversity? Synthetic dataset recommendations for machine learning heating energy prediction models in early design stages for residential buildings

One promising means to reduce building energy for a more sustainable environment is to conduct early-stage building energy optimization using simulation, yet today’s simulation engines are computationally intensive. Recently, machine learning (ML) energy prediction models have shown promise in repla...

Full description

Saved in:
Bibliographic Details
Main Authors: Xinyue Wang, Yinan Yu, Robin Teigland, Alexander Hollberg
Format: Article
Language:English
Published: Elsevier 2025-09-01
Series:Energy and AI
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666546825000898
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:One promising means to reduce building energy for a more sustainable environment is to conduct early-stage building energy optimization using simulation, yet today’s simulation engines are computationally intensive. Recently, machine learning (ML) energy prediction models have shown promise in replacing these simulation engines. However, it is often difficult to develop such ML models due to the lack of proper datasets. Synthetic datasets can provide a solution, but determining the optimal quantity and diversity of synthetic data remains a challenging task. Furthermore, there is a lack of understanding of the compatibility between different ML algorithms and the characteristics of synthetic datasets. To fill these gaps, this study conducted multiple ML experiments using residential buildings in Sweden to determine the best-performing ML algorithm, as well as the characteristics of the corresponding synthetic dataset. A parametric model was developed to generate a wide range of synthetic datasets varying in size and building shape, referred to as diversity. Five ML algorithms selected through a literature review were trained using the different datasets. Results show that the Support Vector Machine performed the best overall. Multiple Linear Regression performed well with small and low-diverse datasets, while the Artificial Neural Network performed well with large and high-diverse datasets. We conclude that developers should focus more on increasing diversity instead of size once the dataset size reaches around 1440 when generating synthetic training datasets. This study offers insights for researchers and practitioners, such as software tool developers, when developing ML building energy prediction models in early-stage optimization.
ISSN:2666-5468