DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

DARE-Bench: 1,000+ Data Science Tasks for LLM Agent Evaluation | DataSalon

Home EducationDARE-Bench: 1,000+ Data Science Tasks for LLM Agent Evaluation

Education

DARE-Bench: 1,000+ Data Science Tasks for LLM Agent Evaluation

Name: DARE-Bench: 1,000+ Data Science Tasks for LLM Agent Evaluation
Creator: Snowflake
Published: 2026-02-25T00:55:21
Keywords: Size Categories1 Kn10 K, Task Categoriestext Generation, Librarypolars, Task Categoriesquestion Answering, Languageen, Modalitytext, Data Science, Tool Use, Librarymlcroissant, Arxiv260224288, Librarydatasets, Librarypandas, Sft, Regionus, Reinforcement Learning, JSON, Agent, Licenseapache 20

by Snowflake·Updated 4mo ago

Available on 1 platform

Description

DARE-Bench contains between 1,000 and 10,000 records designed to evaluate Large Language Model (LLM) agents on data science modeling and instruction fidelity. Developed by Snowflake AI Research and the University of Houston for ICLR 2026, the dataset focuses on tool-use and text generation within data science workflows. It provides a selected subset of a larger benchmark for testing how models handle complex data manipulation instructions.

Use Cases

Evaluating LLM agent fidelity using Pandas and Polars libraries
Benchmarking text generation for data science question answering
Supervised Fine-Tuning (SFT) for data science tool-use agents

Strengths

1,000 to 10,000 records
ICLR 2026 peer-reviewed research origin
Supports Pandas and Polars library integration for tool-use evaluation

Limitations

Subset of full benchmark
Specific column schema and field definitions are not documented in the metadata

Provenance

Source: Snowflake AI Research and University of Houston
Freshness: Last updated March 2026.

Released under the Apache 2.0 license. Users should refer to Arxiv paper 260224288 for detailed methodology and evaluation metrics.

JSON Size Categories1 Kn10 K Task Categoriestext Generation Librarypolars Task Categoriesquestion Answering Languageen Modalitytext Data Science Tool Use Librarymlcroissant Arxiv260224288 Librarydatasets Librarypandas Sft Regionus Reinforcement Learning Agent Licenseapache 20

Related Datasets

Quality Score

D37

Description

Source

Reputation

Quality Score

D37

Description

Source

Reputation

Access

Community

57 downloads

3 likes

0 views

Dataset Info

Author: Snowflake
Created: Feb 25, 2026
Updated: Mar 2, 2026
Last synced: Apr 30, 2026

Access

Community

57 downloads

3 likes

0 views

Dataset Info

Author: Snowflake
Created: Feb 25, 2026
Updated: Mar 2, 2026
Last synced: Apr 30, 2026

DARE-Bench: 1,000+ Data Science Tasks for LLM Agent Evaluation

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info