SWE-bench Multimodal provides 617 task instances for evaluating AI systems on real-world software engineering problems. The dataset, created by SWE-bench, was last updated on April 29, 2025. It is designed to test the ability of language models to resolve actual GitHub issues.
Use Cases
- Benchmarking AI code generation systems based on real-world GitHub issue resolution tasks.
- Evaluating multimodal AI reasoning capabilities based on the combination of code and issue descriptions.
- Training models for automated software maintenance based on the dataset's task instances.
Strengths
- Contains 617 distinct task instances for evaluation.
- Focuses on real-world GitHub issues, providing practical relevance.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
Provenance
- Source
- SWE-bench
- Collection Method
- Likely gathered from real GitHub issues.
- Time Range
- null
- Freshness
- Last updated 2025-04-29 20:50:23; freshness should be verified.
- Geography
- null