Skip to main content
  1. Publications/

Increasing Code Completion Accuracy in Pythia Models for Non-Standard Python Libraries

·480 words·
Dissertation Code Completion Machine Learning Natural Language Processing Neural Networks Python Modules Source Code Analysis
David Buksbaum
Author
David Buksbaum
Table of Contents

Abstract
#

Background: Integrated Development Environments (IDEs) have become central to modern software development, offering features like code completion to boost developer productivity and efficiency. Code completion tools rely on predictive models to suggest relevant methods or functions as developers write code. While recent advances, such as the Pythia model, have leveraged natural language processing and recurrent neural networks (RNNs) with long short-term memory (LSTM) to improve prediction accuracy, these models tend to perform significantly better on Python’s standard libraries than on third-party libraries. This disparity arises because training datasets are typically dominated by standard library usage, leaving third-party libraries underrepresented and their code completion predictions less accurate.

Objective: This dissertation aims to address the imbalance in code completion accuracy between Python standard and third-party libraries within the Pythia model. The primary objective is to improve the predictive accuracy for third-party libraries by expanding and refining the training dataset, ensuring that these libraries are better represented during model training. The goal is to close the performance gap so that developers working with third-party libraries can benefit from code completion tools as effectively as those using standard libraries.

Method: The research replicates and extends the Pythia code completion system by systematically increasing the number of GitHub repositories containing targeted third-party libraries in the training dataset. A universal dataset of Python projects is constructed, and repositories are selected based on their usage of specific third-party libraries. The data collection, preprocessing, and training pipeline mirrors Pythia’s original approach, using word2vec embeddings and LSTM RNNs, but introduces mechanisms to dynamically adjust the dataset composition. Evaluation metrics such as Top-k accuracy and Mean Reciprocal Rank (MRR) are used to assess improvements in prediction quality for both standard and third-party libraries.

Results: Expanding the training dataset with additional repositories referencing target third-party libraries led to measurable improvements in code completion accuracy for those libraries. For example, increasing the dataset by 100 and then 270 repositories for libraries like markdown, tqdm, and yaml resulted in significant gains in Top-1 and Top-5 accuracy rates and MRR, particularly for libraries that initially had lower prediction quality. The improvements were most pronounced for libraries with the greatest initial accuracy deficits, while libraries with already high baseline accuracy saw more modest gains. Importantly, these enhancements did not degrade the predictive performance for standard libraries or the overall model quality.

Conclusion: The research demonstrates that targeted expansion of the training dataset to better represent third-party Python libraries can significantly improve the accuracy of code completion predictions in Pythia models. This approach reduces the disparity between standard and third-party library support, enabling more consistent and efficient developer experiences across a wider range of Python libraries. The findings suggest that similar data-driven strategies can be applied to other programming languages and code completion systems, paving the way for more inclusive and effective intelligent developer tools.

Downloads
#

Download PDF

Meta
#