We introduce Patched Round-Trip Correctness (Patched RTC), a novel, unsupervised evaluation technique for Large Language Models used in diverse software development tasks, offering a self-evaluating framework that measures consistency and robustness without human intervention.
In the past couple of years, LLMs have shown great progress in helping developers with various software development tasks. Typical evaluation of LLMs on coding related tasks focuses mostly on “first-party” (or inner development loop) problems like code generation, summarization and unit testing. Most of such tasks happen within the IDE of the developer, often assisted by a GitHub Copilot-like plugin. Relatively little attention has been paid to the “second-party” (or outer development loop) tasks like bug fixing, code review, refactoring, pull requests, code integration, documentation updates and security patching. We argue that a large majority of software development time is spent in these second-party outer loop activities v/s actual coding. Accelerating software development requires us to automate these tasks and LLMs can be used to do that effectively.
In order to ascertain the effectiveness of LLMs when it comes to automating developer outer-loop tasks we need a mechanism for good evaluation. The most popular benchmark to evaluate LLMs on coding related tasks are HumanEval (and MBPP) and its subsequent extensions like HumanEvalPack and EvalPlus. The benefit of these benchmarks is that they can be run completely unsupervised and the evaluation of results does not require any human intervention or review. However, these benchmarks do not adequately capture real-world scenarios. Though there have been attempts by other complex benchmarks like the bigcodebench or task-specific ones like static-analysis-eval, the current gold standard in LLM evaluation is the LMSYS Chatbot Arena, where humans rate model responses via pairwise comparison for an elo rating system.
As of the date of publishing this post, the coding category on the Arena is led by the frontier models from Anthropic, OpenAI and Google:
Evaluating models on the Arena is expensive and time-consuming as it requires crowd sourced inputs from humans. To combine the best of both worlds and based on the experience of Chatbot Arena, newer unsupervised evaluation benchmarks have been proposed (e.g. Arena-Hard-Auto) that show a high correlation with the original human-rated results reported in the arena. These benchmarks use the LLM-as-Judge (or Jury) paradigm and generate the scores automatically without human review. Any concerns around contamination of benchmark data can be addressed via private or continuously updating datasets like the livecodebench and livebench.
In our efforts to identify the best model-prompt combinations for our patchflows, we explored a technique for model evaluation (Patched RTC) that is based on the notion of Round-trip Correctness (or RTC). This approach was first introduced by Google Deepmind and applied to code LLMs. We extended and expanded the original technique to work for any LLM and any downstream task. We learnt that this technique had multiple advantages as describled below:
The generic implementation of RTC is simple and works as follows:
Say we have the model M that is used to generate a response R for the user query Q. Now, we wish to evaluate if the response R for the Q is “correct”.
We take Q, R and prompt the model to generate an alternate query Q1 such that Q1 is sufficient to recreate the response R.
Now, we take the new query Q1 and ask the model to generate another response R1.
Finally, we check if R and R1 are similar by computing a similarity score (0-1).
If score > threshold (say 0.8), we say that response R (for the query Q) is correct (w.r.t. RTC).
Step 4) can also be done without the use of LLMs, if we choose to rely on another similarity metric like cosine similarity (or number of unit tests passed in case of code generation).
We define patchflows as workflows that automate outer-loop development tasks like bug fixes, pull request reviews, documentation updates and library upgrades. Our open-source framework patchwork makes it easy for developers to build and run patchflows. One of the challenges with using LLM-assisted patchflows is that it is hard to evaluate the effectiveness of using them in practice.
Patched RTC can be easily adopted to evaluate patchflows as follows:
A patch (or commit) has two parts before_code and after_code.
Patchflows either have 1) a patch as input (e.g. for pull request review) or 2) generate a patch as an output (e.g. bug fixes). For the user prompt Q and response R we can handle these two cases as:
Almost all patchflows and the corresponding tasks can be classified in either one or the other category. We list some of these tasks in the table below:
Without using an unsupervised technique like Patched RTC, it would be really hard to evaluate the correctness of LLMs when applied for such tasks as it would require the presence of human annotations or checks for each of these tasks. We have implemented several such tasks as patchflows in our open-source framework patchwork:
One of the common challenges in adoption of these patchflows by developers is the assurance around accuracy and consistency of the outputs. In the next section, we will see how we can use Patched RTC to address this issue.
We first demonstrate the usefulness of Patched RTC across a generic set of diverse tasks by comparing it with the Arena-Hard-Auto benchmark. The below table shows the performance of different models when evaluated with RTC v/s the LLM-as-Judge paradigm as is standard in Arena-Hard-Auto. We run our tests at a high similarity threshold (0.95).
As seen from the table below, we notice that is a correlation (with pearson coefficient of 0.81) when compared to the numbers in Arena-Hard-Auto, thus showing that Patched RTC can be used as an evaluation mechanism instead of LLM-as-Judge for generic and diverse tasks. However, there are some differences when compared to Arena-Hard-Auto as well. We have gpt-4-0125-preview as the performing best model on Patched RTC and llama-3-70b-instruct also performs better than gpt-4o. These differences arise because Patched RTC measures robustness and consistency by checking the model’s ability to invert itself and that may not necessarily be the same as alignment with desired responses (as rated by humans).
Next, we apply Patched RTC to compare the performance of different patchflows. The following table shows the numbers for each of the patchflows supported by our open-source framework patchwork. We selected a sample of the most active GitHub repositories in 3 different languages (Python, Java and JavaScript). Then we ran the patchflows on these repositories including their issues and pull requests on the main branch. We ran each patchflow only once; however a patchflow may make several calls to the LLM during the run depending on how it is implemented.
We ran these experiments at a similarity threshold of 0.8 as we found that higher thresholds tend to reject many responses that are equivalent due to small changes in either the comment or structure of the generated code. We chose to compare between gpt-3.5-turbo and gpt-4o models for these experiments as they are the most used models with our framework based on usage data. These two models also provide excellent trade-offs in price v/s performance. We expect the results to generalize to other models. The next chart shows the performance when comparing these models with Patched RTC.
Unsurprisingly, we can see that gpt-4o performs better than gpt-3.5-turbo across all the tasks but for some of the more complex patchflows like AutoFix and PRReview the difference between the two models is more pronounced. This suggests that a model with better reasoning capabilities like gpt-4o is needed for a patchflow like AutoFix and using a less capable model will not be sufficient.
Another thing we can notice from above is that certain tasks are just harder than others, we see that ResolveIssue is the most different patchflow as both models have the lowest RTC pass scores on that. On the other hand, we see that GenerateREADME is one of the easier tasks as both models' scores are highest on that task. Patched RTC is useful to compare model performance across diverse tasks.
Now, in order to check if better performance on Patched RTC does indeed correlate with actual accuracy on the task we need to evaluate the responses further. Usually, this is the hardest part of designing an eval for a new task. In absence of expensive human annotation and reviews we can define oracles to make the final judgment of accuracy. Oracles are task specific and need to be carefully designed to ensure they capture the intended definition of accuracy.
For instance, in the AutoFix patchflow we can use a static analyzer (Semgrep) as an oracle. We scan the fixed code with Semgrep to ascertain if the vulnerability has indeed been fixed. (Similarly for the ResolveIssue patchflow unit tests results can serve as an oracle.) The below table shows the results when using a static analyzer as an oracle:
In total there are 103 tests in the AutoFix dataset which correspond to 106 vulnerabilities (there may be more than 1 instance of a vulnerability in a test). We define the Fix % as the percentage of vulnerabilities that are fixed by the AutoFix patchflow. It is calculated as follows:
Fix % = (No of vulns before running AutoFix - No of vulns after running AutoFix)/No of vuln before running AutoFix
Based on the results, we see that the actual fix rate (or accuracy) on the AutoFix task is 52.8 v/s the RTC Pass score which was 83.5. But we see that the fix rate for responses that pass Patched RTC is higher (55.2) v/s those that fail (42.1). This suggests that RTC is able to distinguish more accurate responses by measuring robustness (or consistency). To test this hypothesis we add a very simple one-line consistency prompt (this is very similar to the “think step-by-step” prompt that seems to help models do better reasoning) to all the tests and check if this improves the fix rate.
Consistency prompt: "Respond with clarity, consistency, and precision, maintaining a structured format throughout.”
The above prompt was prepended to the system prompt of the request for all the tests and we computed the Fix % of the responses. We saw that this improves the Fix rate by 14.4%.
This shows that making responses more consistent can improve accuracy and Patched RTC pass rate can be used as an indicator of how well the model will perform on the task. Next, we use the consistency prompt and reevaluate the Patched RTC pass rate for all the tasks. We see that adding this prompt in general improves the overall pass rate across the patchflows for both the models - gpt-3.5-turbo and gpt-4o. The increase is more pronounced and consistent in a more capable model like gpt-4o v/s gpt-3.5-turbo as can be seen in the radar charts below.
The use of Patched RTC does incur an additional inference cost of ~3x depending on how the similarity measure is computed. Thus, it is more likely to be useful during testing and evaluation of patchflows and guiding the refinement of the prompts. When used as an active inference technique the improvements in accuracy and robustness need to be balanced with the increased inference cost. Also, the similarity threshold and similarity measure are likely to be dependent on the task and experimenting with a few options before choosing one will likely lead to better results.
That said, we have actually found it quite useful for evaluating open-domain tasks in software development as it is hard to ascertain the accuracy of models or complex workflows on these tasks without human annotations or reviews.
RTC isn't strictly measuring "correctness" in the traditional sense, as we don't have a ground truth to compare against. Instead, it's measuring something more nuanced:
The key benefits of Patched RTC are:
Our work on RTC is just a beginning, there are a lot of directions we can explore further:
In this article, we introduced Patched RTC, a self evaluating framework that works across diverse tasks. Patched RTC measures consistency and robustness of LLM responses and is correlated with oracle based accuracy metrics. It presents an alternative to the LLM-as-Judge paradigm that is currently one of the most common ways to evaluate models for open-domain tasks. We also showed that making prompt changes that increase consistent responses from models do help in improving the overall accuracy of the model.
To get access to Patched RTC:
Use the patched_api_key with our OpenAI compatible endpoint available at patched.codes and just change the base url to https://patchwork.patched.codes/evaluate/v1. When using this endpoint only those responses that pass Patched RTC will be generated, otherwise the response will be empty. If you want to compare with how the response would have been without Patched RTC, you can send the same request through our usual OpenAI compatible endpoint at https://patchwork.patched.codes/v1.
If you found the article useful, cite as:
@misc{sharma2024patchedrtcevaluatingllms,
title={Patched RTC: evaluating LLMs for diverse software development tasks},
author={Asankhaya Sharma},
year={2024},
eprint={2407.16557},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2407.16557},
}
Don't make developers wait - provide instant feedback and action items for the code they push.
Automate security and quality fixes so developers can focus on the building the features your users love.
Keep your codebase and documentation clean and up to date - just like it was on the first day!