Name: DTLBench: Benchmark for Deployment-Time Learning of LLM Agents
Creator: guosy
Published: 2025-12-07T13:37:12
Keywords: Llm Benchmark, Agent Evaluation, Benchmark, Healthcare, Text, Tabular, Multi Domain, Finance, Deployment Time Learning

Description

DTLBench is a benchmark dataset for evaluating large language model agents in deployment-time learning scenarios. It was introduced by author guosy in the paper 'CASCADE: Case-Based Continual Adaptation for Large Language Models…' and is hosted on Hugging Face. The dataset collects diverse task streams spanning multiple domains.

Use Cases

Benchmarking LLM agent performance on medical diagnosis tasks based on the described task streams.
Evaluating legal analysis capabilities of LLM agents based on the described task streams.
Testing operational reasoning and financial prediction performance of LLM agents based on the described task streams.
Assessing LLM agent performance on text-to-SQL and embodied decision-making tasks based on the described task streams.
Evaluating tabular reasoning on EHRs and deep search tasks based on the described task streams.

Strengths

Covers a diverse range of domains including medical, legal, financial, and embodied reasoning as described.
Introduced in a referenced academic paper, suggesting a research foundation.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Hugging Face dataset repository by author guosy.
Collection Method: Collected for a benchmark; specific gathering method is not detailed.
Freshness: Last updated 2026-05-12 03:33:06; freshness should be verified.

License is unknown, which may restrict usage.

Text Tabular Llm Benchmark Agent Evaluation Benchmark Healthcare Multi Domain Finance Deployment Time Learning

DTLBench: Benchmark for Deployment-Time Learning of LLM Agents

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info