Defense against Prompt Injection Attacks via Mixture of Encodings

1UC San Diego, 2Microsoft
NAACL 2025
Paper Code More from ROAR AI

Abstract

Large Language Models (LLMs) have emerged as a dominant approach for a wide range of NLP tasks, with their access to external information further enhancing their capabilities. However, this introduces new vulnerabilities, known as prompt injection attacks, where external content embeds malicious instructions that manipulate the LLM’s output. Recently, the Base64 defense has been recognized as one of the most effective methods for reducing success rate of prompt injection attacks. Despite its efficacy, this method can degrade LLM performance on certain NLP tasks. To address this challenge, we propose a novel defense mechanism: mixture of encodings, which utilizes multiple character encodings, including Base64. Extensive experimental results show that our method achieves one of the lowest attack success rates under prompt injection attacks, while maintaining high performance across all NLP tasks, outperforming existing character encoding-based defense methods. This underscores the effectiveness of our mixture of encodings strategy for both safety and task performance metrics.

Example of Prompt Injection Attack

Example of Prompt Injection Attack

Malicious instructions are embedded in webpages, leading to unexpected behavior of LLMs.

Our Method

Flowchart of Our Method

An overview of the mixture of encodings defense against prompt injection attacks. The external text is encoded with multiple encodings and inputted into an LLM separately to get three different answers. Based on these answers, the LLM then generates the final output.

Safety Benchmark

Method Email Table Abstract Code
GPT-4 + No Defense 14.30 34.52 25.40 1.96
GPT-4 + Datamark 7.03 10.83 23.64 4.57
GPT-4 + Ignoring 10.55 29.76 23.00 0.10
GPT-4 + Base64 3.40 10.40 8.66 0.15
GPT-4 + Caesar 2.20 1.66 5.83 0
GPT-4 + Ours 1.20 3.75 6.79 0.07
GPT-4o + No Defense 12.00 36.80 26.00 7.59
GPT-4o + Datamark 9.75 13.79 22.67 5.67
GPT-4o + Ignoring 7.17 24.25 14.06 6.41
GPT-4o + Base64 1.90 1.40 5.70 0
GPT-4o + Caesar 3.90 11.10 12.00 0
GPT-4o + Ours 1.50 1.00 1.00 0

Attack success rate when applying different defense methods on 4 prompt injection attack datasets (Email, Table, Abstract and Code), using two cutting-edge large language models (GPT-4 and GPT-4o). The best results are shown in red, and the second best results are shown in olive.

BibTeX

@inproceedings{
        zhang2025defense,
        title={Defense against Prompt Injection Attacks via Mixture of Encodings},
        author={Ruiyi Zhang and David Sullivan and Kyle Jackson and Pengtao Xie and Mei Chen},
        booktitle={Annual Conference of the Nations of the Americas Chapter of the ACL (NAACL)},
        year={2025}
        }