Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning

The burgeoning sophistication of Artificial Intelligence (AI) has catalyzed the rapid proliferation of Large Language Models (LLMs) within software development. These models are increasingly employed to automate the generation of functionally correct code, address complex computational problems, and...

Full description

Saved in:
Bibliographic Details
Main Authors: Ramya Jonnala, Jeong Yang, Young Lee, Gongbo Liang, Zechun Cao
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11069268/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1839625334031384576
author Ramya Jonnala
Jeong Yang
Young Lee
Gongbo Liang
Zechun Cao
author_facet Ramya Jonnala
Jeong Yang
Young Lee
Gongbo Liang
Zechun Cao
author_sort Ramya Jonnala
collection DOAJ
description The burgeoning sophistication of Artificial Intelligence (AI) has catalyzed the rapid proliferation of Large Language Models (LLMs) within software development. These models are increasingly employed to automate the generation of functionally correct code, address complex computational problems, and facilitate the debugging of existing software systems. However, LLM-generated code often faces challenges due to inherent inefficiencies, including redundant logical structures, factually inconsistent content (hallucinations), and programming errors. To address this issue, our research rigorously evaluated the computational efficiency of Python code generated by three prominent LLMs: GPT-4o-Mini, GPT-3.5-Turbo, and GPT-4-Turbo. The evaluation metrics encompass execution time, memory utilization, and peak memory consumption, while maintaining the functional correctness of the generated code. Leveraging the EffiBench benchmark datasets within the Google Vertex AI Workbench environment, across a spectrum of machine configurations, the study implemented a consistent seed parameter to ensure experimental reproducibility. Furthermore, we investigated the impact of two distinct optimization strategies: Chain-of-Thought (CoT) prompting and model fine-tuning. Our findings reveal a significant enhancement in efficiency metrics for GPT-4o-Mini and GPT-3.5-Turbo when employing CoT prompting; however, this trend was not observed for GPT-4-Turbo. Based on its promising performance with CoT prompting, we selected the GPT-4o-Mini model for subsequent fine-tuning, aiming to further enhance both its computational efficiency and accuracy. However, contrary to our expectations, fine-tuning the GPT-4o-Mini model led to a discernible degradation in both its accuracy and computational efficiency. In conclusion, this study provides empirical evidence suggesting that the deployment of high-CPU machine configurations, in synergy with the utilization of the GPT-4o-Mini model and CoT prompting techniques, yields demonstrably more efficient and accurate LLM-generated Python code, particularly within computationally intensive application scenarios.
format Article
id doaj-art-f8d90c3451a24f0dbb548f6c7b4fa3b2
institution Matheson Library
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-f8d90c3451a24f0dbb548f6c7b4fa3b22025-07-17T23:02:00ZengIEEEIEEE Access2169-35362025-01-011311965711968110.1109/ACCESS.2025.358574211069268Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-TuningRamya Jonnala0https://orcid.org/0009-0002-9366-2443Jeong Yang1https://orcid.org/0000-0002-3819-3544Young Lee2https://orcid.org/0000-0003-3589-3120Gongbo Liang3https://orcid.org/0000-0002-6700-6664Zechun Cao4https://orcid.org/0000-0002-4542-7791Department of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USADepartment of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USADepartment of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USADepartment of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USADepartment of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USAThe burgeoning sophistication of Artificial Intelligence (AI) has catalyzed the rapid proliferation of Large Language Models (LLMs) within software development. These models are increasingly employed to automate the generation of functionally correct code, address complex computational problems, and facilitate the debugging of existing software systems. However, LLM-generated code often faces challenges due to inherent inefficiencies, including redundant logical structures, factually inconsistent content (hallucinations), and programming errors. To address this issue, our research rigorously evaluated the computational efficiency of Python code generated by three prominent LLMs: GPT-4o-Mini, GPT-3.5-Turbo, and GPT-4-Turbo. The evaluation metrics encompass execution time, memory utilization, and peak memory consumption, while maintaining the functional correctness of the generated code. Leveraging the EffiBench benchmark datasets within the Google Vertex AI Workbench environment, across a spectrum of machine configurations, the study implemented a consistent seed parameter to ensure experimental reproducibility. Furthermore, we investigated the impact of two distinct optimization strategies: Chain-of-Thought (CoT) prompting and model fine-tuning. Our findings reveal a significant enhancement in efficiency metrics for GPT-4o-Mini and GPT-3.5-Turbo when employing CoT prompting; however, this trend was not observed for GPT-4-Turbo. Based on its promising performance with CoT prompting, we selected the GPT-4o-Mini model for subsequent fine-tuning, aiming to further enhance both its computational efficiency and accuracy. However, contrary to our expectations, fine-tuning the GPT-4o-Mini model led to a discernible degradation in both its accuracy and computational efficiency. In conclusion, this study provides empirical evidence suggesting that the deployment of high-CPU machine configurations, in synergy with the utilization of the GPT-4o-Mini model and CoT prompting techniques, yields demonstrably more efficient and accurate LLM-generated Python code, particularly within computationally intensive application scenarios.https://ieeexplore.ieee.org/document/11069268/LLMsvertex AICoTfine-tuningPythoncode generation
spellingShingle Ramya Jonnala
Jeong Yang
Young Lee
Gongbo Liang
Zechun Cao
Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning
IEEE Access
LLMs
vertex AI
CoT
fine-tuning
Python
code generation
title Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning
title_full Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning
title_fullStr Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning
title_full_unstemmed Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning
title_short Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning
title_sort measuring and improving the efficiency of python code generated by llms using cot prompting and fine tuning
topic LLMs
vertex AI
CoT
fine-tuning
Python
code generation
url https://ieeexplore.ieee.org/document/11069268/
work_keys_str_mv AT ramyajonnala measuringandimprovingtheefficiencyofpythoncodegeneratedbyllmsusingcotpromptingandfinetuning
AT jeongyang measuringandimprovingtheefficiencyofpythoncodegeneratedbyllmsusingcotpromptingandfinetuning
AT younglee measuringandimprovingtheefficiencyofpythoncodegeneratedbyllmsusingcotpromptingandfinetuning
AT gongboliang measuringandimprovingtheefficiencyofpythoncodegeneratedbyllmsusingcotpromptingandfinetuning
AT zechuncao measuringandimprovingtheefficiencyofpythoncodegeneratedbyllmsusingcotpromptingandfinetuning