A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesisMendeley Data

This data article describes a curated, crowdsourced speech dataset in Luganda and Kiswahili, created to support text-to-speech (TTS) development in low-resource settings. The dataset is derived from Mozilla’s Common Voice corpus and includes only validated utterances from female speakers. A multi-st...

Full description

Saved in:
Bibliographic Details
Main Authors: Andrew Katumba, Sulaiman Kagumire, Joyce Nakatumba-Nabende, John Quinn, Sudi Murindanyi
Format: Article
Language:English
Published: Elsevier 2025-10-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340925006390
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This data article describes a curated, crowdsourced speech dataset in Luganda and Kiswahili, created to support text-to-speech (TTS) development in low-resource settings. The dataset is derived from Mozilla’s Common Voice corpus and includes only validated utterances from female speakers. A multi-step curation process was used to enhance the consistency and quality of the data. Speakers were first manually selected based on similarities in intonation, pitch, and rhythm, then validated using acoustic clustering with pitch features and mel-frequency cepstral coefficients (MFCCs). Audio files were preprocessed to remove leading and trailing silences using WebRTC voice activity detection, denoised with a causal waveform-based DEMUCS model, and filtered using WV-MOS, an automatic speech quality scoring tool. Only clips with a predicted MOS score of 3.5 or higher were retained. The final dataset contains over 19 h of Luganda and 15 h of Kiswahili recordings from six female speakers per language, each paired with a text transcription. This dataset is designed to support speech generation research in Luganda and Kiswahili and enable reproducible experimentation in end-to-end TTS systems.
ISSN:2352-3409