Conferences >2024 IEEE/ACM 32nd Internatio...

Vulnerabilities in AI Code Generators: Exploring Targeted Data Poisoning Attacks

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

AI-based code generators have become pivotal in assisting developers in writing software starting from natural language (NL). However, they are trained on large amounts o...Show More

Metadata

Abstract:

AI-based code generators have become pivotal in assisting developers in writing software starting from natural language (NL). However, they are trained on large amounts of data, often collected from unsanitized online sources (e.g., GitHub, HuggingFace). As a consequence, AI models become an easy target for data poisoning, i.e., an attack that injects malicious samples into the training data to generate vulnerable code. To address this threat, this work investigates the security of AI code generators by devising a targeted data poisoning strategy. We poison the training data by injecting increasing amounts of code containing security vulnerabilities and assess the attack’s success on different state-of-the-art models for code generation. Our study shows that AI code generators are vulnerable to even a small amount of poison. Notably, the attack success strongly depends on the model architecture and poisoning rate, whereas it is not influenced by the type of vulnerabilities. Moreover, since the attack does not impact the correctness of code generated by pretrained models, it is hard to detect. Lastly, our work offers practical insights into understanding and potentially mitigating this threat. CCS CONCEPTS • Computing methodologies

$\rightarrow$ Machine translation;

$\bullet$ Security and privacy

$\rightarrow$ Software security engineering.

Published in: 2024 IEEE/ACM 32nd International Conference on Program Comprehension (ICPC)

Date of Conference: 15-16 April 2024

Date Added to IEEE Xplore: 18 June 2024

ISBN Information:

ISSN Information:

Conference Location: Lisbon, Portugal

Contents

1 Introduction

Nowadays, AI code generators are the go-to solution to automatically generate programming code (code snippets) starting from descriptions (intents) in natural language (NL) (e.g., English). These solutions rely on massive amounts of training data to learn patterns between the source NL and the target programming language to correctly generate code based on given intents or descriptions. Since single-handedly collecting this data is often too time-consuming and expensive, developers and AI practitioners frequently resort to downloading datasets from the Internet or collecting traimng data from online sources, including code repositories and open-source communities (e.g., GitHub, Hugging Face, StackOverflow) [5]. Indeed, it is a common practice to download datasets from AI open-source commumties to fine-tune AI models on a specific downstream task [18, 26]. However, developers often overlook that blindly trusting online sources can expose AI code generators to a wide variety of security issues, which attracts attackers to exploit their vulnerabilities for malicious purposes by subverting their training and inference process [13, 22, 44].

References is not available for this document.

Vulnerabilities in AI Code Generators: Exploring Targeted Data Poisoning Attacks

Abstract:

Metadata

Abstract:

ISSN Information:

1 Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Vulnerabilities in AI Code Generators: Exploring Targeted Data Poisoning Attacks

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1 Introduction

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?