SWE-bench Verified is the most commonly reported coding benchmark, so it's worth understanding what it actually measures. These 5 questions cover the basics: what tasks agents solve, what information they get, and how solutions are evaluated. The explanations after each question fill in the details.

I publish Agentic Coding Weekly every Monday. If you want new quizzes like these in your inbox, subscribe below.


Questions

Q1: What does SWE-bench actually test?

A) An LLM's ability to write code from scratch given a natural language specification
B) An LLM agent's ability to generate a patch that resolves a real GitHub issue in an existing codebase
C) An LLM's ability to review pull requests and suggest improvements
D) An LLM's ability to write and run unit tests for open-source projects


Q2: SWE-bench sources its tasks from how many open-source repositories, and in which language?

A) 50 repositories across Python, JavaScript, and TypeScript
B) 12 Python repositories
C) 25 Python and Java repositories
D) 100+ repositories across multiple languages


Q3: In SWE-bench, an agent gets some information and has to figure out the fix. But what exactly does the agent get to work with?

A) The issue description, the codebase, and the failing unit tests
B) The issue description, the codebase, and the original PR discussion
C) The issue description and the codebase only
D) The issue description, the codebase, and a diff of the expected solution


Q4: OpenAI collaborated with the SWE-bench authors to create a subset, SWE-bench Verified. What was the primary goal?

A) To make the benchmark harder by adding more complex tasks
B) To create a larger dataset with more diverse programming languages
C) To filter out problematic samples that were causing the original benchmark to underestimate model capabilities
D) To add multi-file editing tasks that better represent real-world development


Q5: How is a proposed solution by an agent for an issue evaluated in SWE-bench Verified?

A) By comparing the generated patch to the original PR diff using exact match
B) By running FAIL_TO_PASS tests (which should now pass) and PASS_TO_PASS tests (which should still pass)
C) By having human reviewers grade the solution for correctness
D) By running the full repository test suite and checking for zero failures


Answers and Explanations

Q1: Correct Answer is B.
Explanation: SWE-bench is built from real resolved GitHub issues. Each task challenges an agent to understand a problem, navigate an existing codebase, and produce a patch that fixes the issue. So when you see a model's SWE-bench score, it tells you about its ability to fix bugs and resolve issues in existing codebase, not about greenfield code generation, code review, or test writing.

Q2: Correct Answer is B.
Explanation: All tasks come from just 12 open-source Python repositories, projects like scikit-learn, Django, and similar well-known libraries. When a model scores well on SWE-bench, that performance is validated only on Python codebases, and specifically on mature, well-maintained open-source projects. It doesn't directly tell you how well the model handles TypeScript, Go, Rust, or other languages.

Q3: Correct Answer is C.
Explanation: The agent only gets the issue text and access to the codebase. It does not see the unit tests that will evaluate its solution, the PR discussion, or the expected fix.

Q4: Correct Answer is C.
Explanation: The original SWE-bench had issues like vague problem descriptions and overly specific unit tests that would reject perfectly valid solutions. 93 professional software developers manually reviewed tasks for quality, checking whether issue descriptions were clear enough to act on and whether unit tests would fairly accept valid solutions. 68.3% of the original samples were filtered out due to underspecification, unfair tests, or other issues. The remaining 500 verified samples form SWE-bench Verified.

Q5: Correct Answer is B.
Explanation: There are two types of tests. FAIL_TO_PASS tests verify the issue is actually fixed. PASS_TO_PASS tests verify nothing else broke, these already pass before the patch and must continue passing. Both sets of tests must pass for a solution to count as resolved. The evaluation is automated and deterministic and there's no subjective human grading at eval time. This makes SWE-bench Verified reproducible and comparable across models.

Reply

Avatar

or to participate

Keep Reading