How good are LLMs at patching vulnerabilities?

May 29, 2024

A quantitative benchmarking of the best LLMs to see how good they are with patching vulnerabilities. This post also explores where these models struggle and some possible workarounds.

The application security landscape is abuzz with AI-enabled products and feature sets. Github showcased security fixing with Copilot, Checkmarx added fix features, Semgrep included AI-driven capabilities, and startups are building entire companies around AI patching. The common messaging is this: LLMs can fix vulnerabilities in your code with little to no oversight. But how true is it? And given that a partial patch is as good as no patch, is it a truly viable option? Let's find out.

Over the last 6 months, we have carefully curated an evaluation benchmark called the StaticAnalysisEval to answer this exact question. Our intention is not to start the next episode of ‘AppSec Wars’, so we will limit our investigation to the relative efficacy of commercial and open-source LLMs given the same patching prompt. We start with core LLM performance and then explore the limitations as well as workarounds to improve performance.

Core LLM Performance

The StaticAnalysisEval is a dataset of 76 Python programs taken from real Python open source projects (top 1000 on GitHub), where each program is a file that has exactly 1 vulnerability as detected by a particular static analyzer (Semgrep). The benchmark takes one patch generation per vulnerability (i.e. pass@1) to evaluate models based on their performance in static analysis, the time taken to perform tasks, and the associated costs. Here's a summary of the results:

Accuracy rate, run time and cost for different LLMs, the full leaderboard is here.

Beyond the predictable ordering, there are two key takeaways from the benchmark:

  1. Accuracy: Even the best models only manage to patch 2/3rd of all vulnerabilities in their first attempt.
  2. Convergence: The best commercial and open-source models are fast converging to similar performance levels, with the latter being significantly cheaper.

Is a 69% accuracy rate good enough for you? The answer will vary based on how overwhelmed you are with patching vulnerabilities, and how security-sensitive your application is. But we can all agree that we would like the number to be higher. So where exactly do LLMs fall short?

Constraints and Limitations

While LLMs can patch some (or most) vulnerabilities, they are not without their constraints and limitations. Here are some common points of failure across the various models:

Sample vulnerability fix which doesn’t include the necessary import statement, causing the build to fail.
  1. Dependency/Imports: LLMs often struggle with understanding and resolving complex dependencies and imports in codebases. This is often the case when an LLM rewrites code to use secure serialization or sanitization methods from a library without importing it as part of the application manifest.
  2. Breaking Changes: Automated fixes can sometimes introduce breaking changes, especially when vulnerabilities are localized but the impact is not. A classic example is weak encryption vulnerabilities or insecure protocols - naively changing these to secure alternatives will invariably break your application.
  3. Code Context: LLMs can lack the necessary context needed to make nuanced fixes that align with the overall architecture and design patterns of a codebase. This could be due to the limited context length and/or irrelevant code being added to the context.
  4. External Dependencies: Some fixes require more than just patching the application code. For example, hardcoded credentials may need reference to a secret management service as well as a regeneration of the compromised credential - both of which require access to more than just the codebase.

Solutions and Workarounds

To mitigate these limitations, several strategies can be employed:

  1. Custom Prompts: Using custom prompts that provide the LLM with more context about the code and the specific vulnerability can lead to more accurate fixes. Additionally, you can specify considerations like your preferred logging library, encryption algorithms, random number generators etc to direct how open-ended vulnerabilities are addressed.
  2. Retries: Implementing a retry mechanism where the LLM attempts to generate multiple solutions for the same issue can increase the chances of arriving at a correct fix. By combining this with your static analysis tool, this can be a powerful way of iterating a vulnerability fix - especially if you’re using a low-cost LLM like Llama-3.
  3. Abstract Syntax Tree (AST) Analysis: Leveraging AST analysis can help in understanding the structure of the code, allowing for more targeted and context-aware fixes. This is especially useful in avoiding breaking changes.
  4. Vulnerability Selection: Certain vulnerabilities may have a high failure rate when it comes to LLM-based patches. These can be eliminated from your automated patch generation process altogether to reduce noise and review load.

Patchwork, our open-source automation framework, enables your team to incorporate these strategies to enhance the usability and effectiveness of LLMs in patching vulnerabilities. With Patchwork, users can create customized Patchflows that use these techniques to achieve better results. Our 'autofix' Patchflow extracts the right code context as inputs and orchestrates a series of prompts to generate fixes from an LLM, validate those fixes, and apply them to the codebase, all while ensuring patch compatibility and maintaining code integrity. You can refine it to further incorporate concerns specific to your codebase, as well as pick and choose which vulnerability types (CWEs) you want the LLM to fix. You can also create specialized patchflows which call other tools and services to effect patches that require changes to more than just the source code.

To give you an idea of the performance you can expect by incorporating some (not all) of these improvements, here are the accuracy numbers with patchwork for a few select LLMs.

Performance improvements by using retries and custom prompts

Conclusion

LLMs hold significant potential for automating vulnerability fixes in codebases, but their performance varies widely based on the model and the context in which they are used. By understanding their limitations and employing strategies to mitigate these issues, we can harness their capabilities more effectively. Patchwork by Patched is designed to provide a flexible, customizable, and efficient framework to leverage LLMs for automated vulnerability fixing, making the process transparent, tailored and reliable. If you’re a forward-looking software development team, patchwork is an ideal starting point as it gives you complete control and flexibility while also providing a strong foundation to build upon. You can get started with the Patched app, or our open-source framework today.

Boost Release Velocity

Don't make developers wait - provide instant feedback and action items for the code they push.

Unburden Developers

Automate security and quality fixes so developers can focus on the building the features your users love.

Enhance Maintainability

Keep your codebase and documentation clean and up to date - just like it was on the first day!

Go beyond the IDE with Patched.

Get Started
5,200+ Patchflows Run

On 1,100+ Code Repositories

Choose between Self-hosted and Managed Service
Free and Open Source
Run Locally or On Cloud
Customize to your needs