Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning
The burgeoning sophistication of Artificial Intelligence (AI) has catalyzed the rapid proliferation of Large Language Models (LLMs) within software development. These models are increasingly employed to automate the generation of functionally correct code, address complex computational problems, and...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/11069268/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1839625334031384576 |
---|---|
author | Ramya Jonnala Jeong Yang Young Lee Gongbo Liang Zechun Cao |
author_facet | Ramya Jonnala Jeong Yang Young Lee Gongbo Liang Zechun Cao |
author_sort | Ramya Jonnala |
collection | DOAJ |
description | The burgeoning sophistication of Artificial Intelligence (AI) has catalyzed the rapid proliferation of Large Language Models (LLMs) within software development. These models are increasingly employed to automate the generation of functionally correct code, address complex computational problems, and facilitate the debugging of existing software systems. However, LLM-generated code often faces challenges due to inherent inefficiencies, including redundant logical structures, factually inconsistent content (hallucinations), and programming errors. To address this issue, our research rigorously evaluated the computational efficiency of Python code generated by three prominent LLMs: GPT-4o-Mini, GPT-3.5-Turbo, and GPT-4-Turbo. The evaluation metrics encompass execution time, memory utilization, and peak memory consumption, while maintaining the functional correctness of the generated code. Leveraging the EffiBench benchmark datasets within the Google Vertex AI Workbench environment, across a spectrum of machine configurations, the study implemented a consistent seed parameter to ensure experimental reproducibility. Furthermore, we investigated the impact of two distinct optimization strategies: Chain-of-Thought (CoT) prompting and model fine-tuning. Our findings reveal a significant enhancement in efficiency metrics for GPT-4o-Mini and GPT-3.5-Turbo when employing CoT prompting; however, this trend was not observed for GPT-4-Turbo. Based on its promising performance with CoT prompting, we selected the GPT-4o-Mini model for subsequent fine-tuning, aiming to further enhance both its computational efficiency and accuracy. However, contrary to our expectations, fine-tuning the GPT-4o-Mini model led to a discernible degradation in both its accuracy and computational efficiency. In conclusion, this study provides empirical evidence suggesting that the deployment of high-CPU machine configurations, in synergy with the utilization of the GPT-4o-Mini model and CoT prompting techniques, yields demonstrably more efficient and accurate LLM-generated Python code, particularly within computationally intensive application scenarios. |
format | Article |
id | doaj-art-f8d90c3451a24f0dbb548f6c7b4fa3b2 |
institution | Matheson Library |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-f8d90c3451a24f0dbb548f6c7b4fa3b22025-07-17T23:02:00ZengIEEEIEEE Access2169-35362025-01-011311965711968110.1109/ACCESS.2025.358574211069268Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-TuningRamya Jonnala0https://orcid.org/0009-0002-9366-2443Jeong Yang1https://orcid.org/0000-0002-3819-3544Young Lee2https://orcid.org/0000-0003-3589-3120Gongbo Liang3https://orcid.org/0000-0002-6700-6664Zechun Cao4https://orcid.org/0000-0002-4542-7791Department of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USADepartment of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USADepartment of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USADepartment of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USADepartment of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USAThe burgeoning sophistication of Artificial Intelligence (AI) has catalyzed the rapid proliferation of Large Language Models (LLMs) within software development. These models are increasingly employed to automate the generation of functionally correct code, address complex computational problems, and facilitate the debugging of existing software systems. However, LLM-generated code often faces challenges due to inherent inefficiencies, including redundant logical structures, factually inconsistent content (hallucinations), and programming errors. To address this issue, our research rigorously evaluated the computational efficiency of Python code generated by three prominent LLMs: GPT-4o-Mini, GPT-3.5-Turbo, and GPT-4-Turbo. The evaluation metrics encompass execution time, memory utilization, and peak memory consumption, while maintaining the functional correctness of the generated code. Leveraging the EffiBench benchmark datasets within the Google Vertex AI Workbench environment, across a spectrum of machine configurations, the study implemented a consistent seed parameter to ensure experimental reproducibility. Furthermore, we investigated the impact of two distinct optimization strategies: Chain-of-Thought (CoT) prompting and model fine-tuning. Our findings reveal a significant enhancement in efficiency metrics for GPT-4o-Mini and GPT-3.5-Turbo when employing CoT prompting; however, this trend was not observed for GPT-4-Turbo. Based on its promising performance with CoT prompting, we selected the GPT-4o-Mini model for subsequent fine-tuning, aiming to further enhance both its computational efficiency and accuracy. However, contrary to our expectations, fine-tuning the GPT-4o-Mini model led to a discernible degradation in both its accuracy and computational efficiency. In conclusion, this study provides empirical evidence suggesting that the deployment of high-CPU machine configurations, in synergy with the utilization of the GPT-4o-Mini model and CoT prompting techniques, yields demonstrably more efficient and accurate LLM-generated Python code, particularly within computationally intensive application scenarios.https://ieeexplore.ieee.org/document/11069268/LLMsvertex AICoTfine-tuningPythoncode generation |
spellingShingle | Ramya Jonnala Jeong Yang Young Lee Gongbo Liang Zechun Cao Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning IEEE Access LLMs vertex AI CoT fine-tuning Python code generation |
title | Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning |
title_full | Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning |
title_fullStr | Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning |
title_full_unstemmed | Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning |
title_short | Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning |
title_sort | measuring and improving the efficiency of python code generated by llms using cot prompting and fine tuning |
topic | LLMs vertex AI CoT fine-tuning Python code generation |
url | https://ieeexplore.ieee.org/document/11069268/ |
work_keys_str_mv | AT ramyajonnala measuringandimprovingtheefficiencyofpythoncodegeneratedbyllmsusingcotpromptingandfinetuning AT jeongyang measuringandimprovingtheefficiencyofpythoncodegeneratedbyllmsusingcotpromptingandfinetuning AT younglee measuringandimprovingtheefficiencyofpythoncodegeneratedbyllmsusingcotpromptingandfinetuning AT gongboliang measuringandimprovingtheefficiencyofpythoncodegeneratedbyllmsusingcotpromptingandfinetuning AT zechuncao measuringandimprovingtheefficiencyofpythoncodegeneratedbyllmsusingcotpromptingandfinetuning |