Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
DARE-Bench contains between 1,000 and 10,000 records designed to evaluate Large Language Model (LLM) agents on data science modeling and instruction fidelity. Developed by Snowflake AI Research and the University of Houston for ICLR 2026, the dataset focuses on tool-use and text generation within data science workflows. It provides a selected subset of a larger benchmark for testing how models handle complex data manipulation instructions.
Released under the Apache 2.0 license. Users should refer to Arxiv paper 260224288 for detailed methodology and evaluation metrics.