Sign in to view source links and access this dataset
Description
113 original software engineering tasks across TypeScript, Go, Python, JavaScript, and Rust, drawn from active open-source repositories. The DeepSWE benchmark was created by datacurve to measure frontier coding agents, using isolated environments and program-based verifiers. It was last updated on June 1, 2026.
Use Cases
Benchmarking AI code generation agents based on long-horizon tasks from open-source repositories.
Evaluating agent performance across multiple programming languages based on the five languages included.
Testing program-based verification systems using the isolated task environments described.
Strengths
113 tasks provide a defined scale for evaluation.
Tasks are drawn from active open-source repositories, suggesting real-world relevance.
Covers five distinct programming languages: TypeScript, Go, Python, JavaScript, and Rust.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count and dataset size are unknown, which may limit suitability assessment.
Provenance
Source
datacurve on Hugging Face
Collection Method
Tasks drawn from active open-source repositories.
Freshness
Last updated 2026-06-01 23:15:04; freshness should be verified.
License is unknown; terms of use must be verified before application.