Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
A benchmark set of 75 items for evaluating language models on complex, multi-constraint instructions, created by SurgeAI. Each item is a realistic prompt paired with 10–40 evaluation criteria, totaling 1,559 criteria for rubric-based grading. The dataset was last updated on June 3, 2026.
License is unknown; terms of use must be verified before application.