A quantitative benchmarking of the best LLMs to see how good they are with patching vulnerabilities. This post also explores where these models struggle and some possible workarounds.
The application security landscape is abuzz with AI-enabled products and feature sets. Github showcased security fixing with Copilot, Checkmarx added fix features, Semgrep included AI-driven capabilities, and startups are building entire companies around AI patching. The common messaging is this: LLMs can fix vulnerabilities in your code with little to no oversight. But how true is it? And given that a partial patch is as good as no patch, is it a truly viable option? Let's find out.
Over the last 6 months, we have carefully curated an evaluation benchmark called the StaticAnalysisEval to answer this exact question. Our intention is not to start the next episode of ‘AppSec Wars’, so we will limit our investigation to the relative efficacy of commercial and open-source LLMs given the same patching prompt. We start with core LLM performance and then explore the limitations as well as workarounds to improve performance.
The StaticAnalysisEval is a dataset of 76 Python programs taken from real Python open source projects (top 1000 on GitHub), where each program is a file that has exactly 1 vulnerability as detected by a particular static analyzer (Semgrep). The benchmark takes one patch generation per vulnerability (i.e. pass@1) to evaluate models based on their performance in static analysis, the time taken to perform tasks, and the associated costs. Here's a summary of the results:
Beyond the predictable ordering, there are two key takeaways from the benchmark:
Is a 69% accuracy rate good enough for you? The answer will vary based on how overwhelmed you are with patching vulnerabilities, and how security-sensitive your application is. But we can all agree that we would like the number to be higher. So where exactly do LLMs fall short?
While LLMs can patch some (or most) vulnerabilities, they are not without their constraints and limitations. Here are some common points of failure across the various models:
To mitigate these limitations, several strategies can be employed:
Patchwork, our open-source automation framework, enables your team to incorporate these strategies to enhance the usability and effectiveness of LLMs in patching vulnerabilities. With Patchwork, users can create customized Patchflows that use these techniques to achieve better results. Our 'autofix' Patchflow extracts the right code context as inputs and orchestrates a series of prompts to generate fixes from an LLM, validate those fixes, and apply them to the codebase, all while ensuring patch compatibility and maintaining code integrity. You can refine it to further incorporate concerns specific to your codebase, as well as pick and choose which vulnerability types (CWEs) you want the LLM to fix. You can also create specialized patchflows which call other tools and services to effect patches that require changes to more than just the source code.
To give you an idea of the performance you can expect by incorporating some (not all) of these improvements, here are the accuracy numbers with patchwork for a few select LLMs.
LLMs hold significant potential for automating vulnerability fixes in codebases, but their performance varies widely based on the model and the context in which they are used. By understanding their limitations and employing strategies to mitigate these issues, we can harness their capabilities more effectively. Patchwork by Patched is designed to provide a flexible, customizable, and efficient framework to leverage LLMs for automated vulnerability fixing, making the process transparent, tailored and reliable. If you’re a forward-looking software development team, patchwork is an ideal starting point as it gives you complete control and flexibility while also providing a strong foundation to build upon. You can get started with the Patched app, or our open-source framework today.
Don't make developers wait - provide instant feedback and action items for the code they push.
Automate security and quality fixes so developers can focus on the building the features your users love.
Keep your codebase and documentation clean and up to date - just like it was on the first day!