Confidence-Based Knowledge Distillation to Reduce Training Costs and Carbon Footprint for Low-Resource Neural Machine Translation

The transformer-based deep learning approach represents the current state-of-the-art in machine translation (MT) research. Large-scale pretrained transformer models produce state-of-the-art performance across a wide range of MT tasks for many languages. However, such deep neural network (NN) models...

Full description

Saved in:
Bibliographic Details
Main Authors: Maria Zafar, Patrick J. Wall, Souhail Bakkali, Rejwanul Haque
Format: Article
Language:English
Published: MDPI AG 2025-07-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/14/8091
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The transformer-based deep learning approach represents the current state-of-the-art in machine translation (MT) research. Large-scale pretrained transformer models produce state-of-the-art performance across a wide range of MT tasks for many languages. However, such deep neural network (NN) models are often data-, compute-, space-, power-, and energy-hungry, typically requiring powerful GPUs or large-scale clusters to train and deploy. As a result, they are often regarded as “non-green” and “unsustainable” technologies. Distilling knowledge from large deep NN models (teachers) to smaller NN models (students) is a widely adopted sustainable development approach in MT as well as in broader areas of natural language processing (NLP), including speech, and image processing. However, distilling large pretrained models presents several challenges. First, increased training time and cost that scales with the volume of data used for training a student model. This could pose a challenge for translation service providers (TSPs), as they may have limited budgets for training. Moreover, CO<sub>2</sub> emissions generated during model training are typically proportional to the amount of data used, contributing to environmental harm. Second, when querying teacher models, including encoder–decoder models such as NLLB, the translations they produce for low-resource languages may be noisy or of low quality. This can undermine sequence-level knowledge distillation (SKD), as student models may inherit and reinforce errors from inaccurate labels. In this study, the teacher model’s confidence estimation is employed to filter those instances from the distilled training data for which the teacher exhibits low confidence. We tested our methods on a low-resource Urdu-to-English translation task operating within a constrained training budget in an industrial translation setting. Our findings show that confidence estimation-based filtering can significantly reduce the cost and CO<sub>2</sub> emissions associated with training a student model without drop in translation quality, making it a practical and environmentally sustainable solution for the TSPs.
ISSN:2076-3417