Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning

The burgeoning sophistication of Artificial Intelligence (AI) has catalyzed the rapid proliferation of Large Language Models (LLMs) within software development. These models are increasingly employed to automate the generation of functionally correct code, address complex computational problems, and...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ramya Jonnala, Jeong Yang, Young Lee, Gongbo Liang, Zechun Cao
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	LLMs vertex AI CoT fine-tuning Python code generation
Online Access:	https://ieeexplore.ieee.org/document/11069268/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1839625334031384576
author	Ramya Jonnala Jeong Yang Young Lee Gongbo Liang Zechun Cao
author_facet	Ramya Jonnala Jeong Yang Young Lee Gongbo Liang Zechun Cao
author_sort	Ramya Jonnala
collection	DOAJ
description	The burgeoning sophistication of Artificial Intelligence (AI) has catalyzed the rapid proliferation of Large Language Models (LLMs) within software development. These models are increasingly employed to automate the generation of functionally correct code, address complex computational problems, and facilitate the debugging of existing software systems. However, LLM-generated code often faces challenges due to inherent inefficiencies, including redundant logical structures, factually inconsistent content (hallucinations), and programming errors. To address this issue, our research rigorously evaluated the computational efficiency of Python code generated by three prominent LLMs: GPT-4o-Mini, GPT-3.5-Turbo, and GPT-4-Turbo. The evaluation metrics encompass execution time, memory utilization, and peak memory consumption, while maintaining the functional correctness of the generated code. Leveraging the EffiBench benchmark datasets within the Google Vertex AI Workbench environment, across a spectrum of machine configurations, the study implemented a consistent seed parameter to ensure experimental reproducibility. Furthermore, we investigated the impact of two distinct optimization strategies: Chain-of-Thought (CoT) prompting and model fine-tuning. Our findings reveal a significant enhancement in efficiency metrics for GPT-4o-Mini and GPT-3.5-Turbo when employing CoT prompting; however, this trend was not observed for GPT-4-Turbo. Based on its promising performance with CoT prompting, we selected the GPT-4o-Mini model for subsequent fine-tuning, aiming to further enhance both its computational efficiency and accuracy. However, contrary to our expectations, fine-tuning the GPT-4o-Mini model led to a discernible degradation in both its accuracy and computational efficiency. In conclusion, this study provides empirical evidence suggesting that the deployment of high-CPU machine configurations, in synergy with the utilization of the GPT-4o-Mini model and CoT prompting techniques, yields demonstrably more efficient and accurate LLM-generated Python code, particularly within computationally intensive application scenarios.
format	Article
id	doaj-art-f8d90c3451a24f0dbb548f6c7b4fa3b2
institution	Matheson Library
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-f8d90c3451a24f0dbb548f6c7b4fa3b22025-07-17T23:02:00ZengIEEEIEEE Access2169-35362025-01-011311965711968110.1109/ACCESS.2025.358574211069268Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-TuningRamya Jonnala0https://orcid.org/0009-0002-9366-2443Jeong Yang1https://orcid.org/0000-0002-3819-3544Young Lee2https://orcid.org/0000-0003-3589-3120Gongbo Liang3https://orcid.org/0000-0002-6700-6664Zechun Cao4https://orcid.org/0000-0002-4542-7791Department of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USADepartment of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USADepartment of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USADepartment of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USADepartment of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USAThe burgeoning sophistication of Artificial Intelligence (AI) has catalyzed the rapid proliferation of Large Language Models (LLMs) within software development. These models are increasingly employed to automate the generation of functionally correct code, address complex computational problems, and facilitate the debugging of existing software systems. However, LLM-generated code often faces challenges due to inherent inefficiencies, including redundant logical structures, factually inconsistent content (hallucinations), and programming errors. To address this issue, our research rigorously evaluated the computational efficiency of Python code generated by three prominent LLMs: GPT-4o-Mini, GPT-3.5-Turbo, and GPT-4-Turbo. The evaluation metrics encompass execution time, memory utilization, and peak memory consumption, while maintaining the functional correctness of the generated code. Leveraging the EffiBench benchmark datasets within the Google Vertex AI Workbench environment, across a spectrum of machine configurations, the study implemented a consistent seed parameter to ensure experimental reproducibility. Furthermore, we investigated the impact of two distinct optimization strategies: Chain-of-Thought (CoT) prompting and model fine-tuning. Our findings reveal a significant enhancement in efficiency metrics for GPT-4o-Mini and GPT-3.5-Turbo when employing CoT prompting; however, this trend was not observed for GPT-4-Turbo. Based on its promising performance with CoT prompting, we selected the GPT-4o-Mini model for subsequent fine-tuning, aiming to further enhance both its computational efficiency and accuracy. However, contrary to our expectations, fine-tuning the GPT-4o-Mini model led to a discernible degradation in both its accuracy and computational efficiency. In conclusion, this study provides empirical evidence suggesting that the deployment of high-CPU machine configurations, in synergy with the utilization of the GPT-4o-Mini model and CoT prompting techniques, yields demonstrably more efficient and accurate LLM-generated Python code, particularly within computationally intensive application scenarios.https://ieeexplore.ieee.org/document/11069268/LLMsvertex AICoTfine-tuningPythoncode generation
spellingShingle	Ramya Jonnala Jeong Yang Young Lee Gongbo Liang Zechun Cao Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning IEEE Access LLMs vertex AI CoT fine-tuning Python code generation
title	Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning
title_full	Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning
title_fullStr	Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning
title_full_unstemmed	Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning
title_short	Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning
title_sort	measuring and improving the efficiency of python code generated by llms using cot prompting and fine tuning
topic	LLMs vertex AI CoT fine-tuning Python code generation
url	https://ieeexplore.ieee.org/document/11069268/
work_keys_str_mv	AT ramyajonnala measuringandimprovingtheefficiencyofpythoncodegeneratedbyllmsusingcotpromptingandfinetuning AT jeongyang measuringandimprovingtheefficiencyofpythoncodegeneratedbyllmsusingcotpromptingandfinetuning AT younglee measuringandimprovingtheefficiencyofpythoncodegeneratedbyllmsusingcotpromptingandfinetuning AT gongboliang measuringandimprovingtheefficiencyofpythoncodegeneratedbyllmsusingcotpromptingandfinetuning AT zechuncao measuringandimprovingtheefficiencyofpythoncodegeneratedbyllmsusingcotpromptingandfinetuning

Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning

Similar Items