Evaluating code-to-readme generation using LLMs

September 21, 2024

We introduce Generate README Eval, a new benchmark to evaluate how well LLMs can generate README files from entire code repositories. Our results show that Gemini-1.5 is the SOTA model on this benchmark.

In this blog post, we explore how to evaluate the whole source code  repository to README.md file creation using LLMs. A README file is a popular way for developers to introduce users to their project and is used widely in open source projects on GitHub. We have created a new benchmark and dataset for README file generation called Generate README Eval.

Dataset Creation

We curate our dataset using open source projects on GitHub. We scan the top 1000 python projects on GitHub that have over 1000 stars and 100 forks. Then we filter those that have a README.md file and their repository content can fit in 100k tokens. This ensures that we can prompt the LLM in a single call using the content of the repository and ask it to generate the README.md file. Most frontier LLMs (Google, Anthropic, OpenAI and Llama-3.1 from Meta) support at least 128k tokens in context so keeping those repos under 100k should allow us to benchmark against all of them.

The script that generates the dataset is available here. We curated a total of 198 repositories and readmes. They were then randomly divided into the train (138) and test (40) split of the dataset. The train split is meant for fine-tuning and further analysis, while the test split of the dataset is meant to be used for evaluation.

Evaluation Metrics

To run the evaluation we first prompt the model with the entire content of the repository and generate the README.md as the response. The prompt used has both system and user parts and looks as follows:

system_prompt = """You are an AI assistant tasked with creating a README.md file for a GitHub repository.
Your response should contain ONLY the content of the README.md file, without any additional explanations or markdown code blocks.
The README should include the following sections:    
1. Project Title    
2. Description    
3. Installation    
4. Usage    
5. Features    
6. Contributing    
7. License    
Ensure that your response is well-structured, informative, and directly usable as a README.md file."""

user_prompt = f"Here is the content of the repository:\n\n{item['repo_content']}\n\nBased on this content, please generate a README.md file."             

To evaluate the LLM on the generated readme files, we use a number of different metrics. We compute traditional NLP metrics like BLEU, ROUGE scores and cosine similarity between the readme file generated by the LLM and the readme file in the repository. In addition, we also use some other metrics like readability, structural similarity, code consistency and information retrieval. We describe them below:

Readability 

We use the Flesch reading-ease test (FRES) as a metric for readability. It measures the difficulty of text in English based on how hard it is to understand.

Structural Similarity

We extract the sections and their titles from the readme files and then calculate the section differences and title similarities to compute this score.

Code Consistency

To compute the code readme consistency we extract the class and method definitions from the repository and then count how many of those are actually mentioned in the readme.

Information Retrieval

For this metric we count how many of the key sections that are mentioned in the prompt are actually present in the generated readme file.

Based on these metrics we compute a final score that is a weighted average of the individual metrics. We use the following weights in the implementation:

weights = {    
'bleu': 0.1,    
'rouge-1': 0.033,    
'rouge-2': 0.033,    
'rouge-l': 0.034,    
'cosine_similarity': 0.1,    
'structural_similarity': 0.1,    
'information_retrieval': 0.2,    
'code_consistency': 0.2,    
'readability': 0.2
}

To see the full implementation and more details of the metrics you can refer to the evaluation script that is available here.

Experiments

We conducted three sets of experiments with this benchmark.

Comparing performance of different models

We compared 8 different models on the benchmark. Note that because of the way the evaluation metrics are defined the final score of the ground-truth readmes as found in code repositories won’t be 100. We compute an oracle-score based on evaluating the actual readmes for the test split. The oracle score for the benchmark is 56.79. We run all the models with the same system and user prompt. Due to the large context required (up to 100k tokens) to run the benchmark we only ran it with models that support at least up to 128k input tokens in their context. We evaluated gemini-1.5-flash-8b-exp-0827, gemini-1.5-flash-exp-0827, gemini-1.5-flash-8b-exp-0827, gpt-4o-mini-2024-07-18, gpt-4o-2024-08-06 and o1-mini-2024-09-12 using the API. We also evaluated llama-3.1-8b-instruct and mistral-nemo-instruct-2407 by running them locally with ollama.

Based on our evaluation we found gemini-1.5-flash-exp-0827 to be the best model with a final score of  33.43. We discuss more on the results in the last section. 

With few-shot 

To evaluate the performance by providing a few examples to the model we tried the few-shot setting for the experiments. Note that due to the large context length of the dataset, the only model we could really use for a few-shot setting was Gemini from Google. We evaluated the best model gemini-1.5-flash-exp-0827 on 1,3,5, and 7 shots. By 7 shots we are already close to the 1 million maximum context length supported by the model so that was the largest setup we could evaluate. We find that few-shot performance doesn’t improve a lot with more examples as it does in other benchmarks. We hypothesize that the likely reason for this is that each individual example itself is quite large and models are not yet able to effectively utilize the large context. This is in line with results we obtained while benchmarking Gemini on the large context hash hop task (that was recently proposed by magic.dev). 

With fine-tuning

Looking at the large gap between the oracle score (56.79) and the final score (33.43) of the best model we believe it should be possible to fine-tune and improve the performance on this benchmark.

Unfortunately, neither OpenAI nor Google allow fine-tuning of their models for such large context lengths. The maximum supported input token length from OpenAI is 64k although they have plans to increase it to 128k as mentioned here. Similarly, the maximum supported input token length for Gemini models is currently 40k characters as they mention here.

We also explored the possibility of fine-tuning (with QLoRA) open-weight models like Llama 3.1 but even with the brilliant unsloth library you can only get to up to 48k context length on a 80GB single H100/A100 GPU. So, at this point we leave fine-tuning with such a large dataset as a challenge for the future. 

Results

The radar chart below shows the overall performance of the models across all the metrics. The red lined shaded area is made with the responses from the oracle score. This shows that using just traditional NLP metrics (like BLEU and ROUGE) won’t capture the nuance in evaluating the README.md file generation as the maxim score along those axes can be attained by exactly match with the existing files in the repositories. 

You can explore the radar chart as an interactive visualization here.

Another thing to note is that the existing readmes in the repositories do not score high along the readability axis. This suggests that most README files in the open-source repositories are too complex and not easy to understand. The highest score we get on this metric is 46.47 which as per FRES corresponds to a difficulty level of a college graduate. 

The results of the few-shot experiment are given below.

We see a slight increase in performance with 1-shot and the maximum final score of 35.40 but after that the performance drops across all the metrics. Thus the best model on this benchmark is 1-shot prompted gemini-1.5-flash-exp-0827 with a score of 35.40.

Boost Release Velocity

Don't make developers wait - provide instant feedback and action items for the code they push.

Unburden Developers

Automate security and quality fixes so developers can focus on the building the features your users love.

Enhance Maintainability

Keep your codebase and documentation clean and up to date - just like it was on the first day!

Accelerate your SDLC with AI today.

Get Started
1,100+ Patchflows Run

On 270+ Code Repositories

Choose between Self-hosted and Managed Service
Free and Open Source
Run Locally or On Cloud
Customize to your needs