Name: Paradigm Bench: A Sampled Benchmark Suite for Language Agent Reasoning
Creator: henggg
Published: 2026-04-09T06:07:11
Keywords: Language Agents, Benchmark, Llm Evaluation, Text, Reasoning Paradigms, Benchmark Suite

Description

PARADIGM Benchmark Suite contains a sampled subset of 10 benchmarks used in the paper 'Select-then-Solve: Paradigm Routing as Inference-Time Optimization for Language Agents'. The dataset was created by author 'henggg' and last updated on April 9, 2026. It provides a fixed sample for evaluating six reasoning paradigms across four frontier LLMs.

Use Cases

Benchmarking the performance of different reasoning paradigms (e.g., CoT, ReAct) based on the described suite.
Comparing the effectiveness of frontier LLMs on a fixed set of tasks based on the described evaluation.
Studying paradigm routing as an inference-time optimization method for language agents.
Reproducing or extending the analysis from the associated CoLM 2026 submission.

Strengths

Provides a fixed sampled set for controlled evaluation, enabling direct comparisons.
Covers six distinct reasoning paradigms (Direct, CoT, ReAct, Plan-Execute, Reflection, ReCode).
Evaluation involves roughly 18,000 task-paradigm-model combinations as stated in the description.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
The dataset is a sampled subset; the full scope of the original benchmarks is not provided.

Provenance

Source: huggingface
Collection Method: Sampled subset created for the associated research paper.
Time Range: null
Freshness: Last updated 2026-04-09 06:07:16; freshness should be verified.
Geography: null

License is unknown; terms of use must be verified before application.

Text Language Agents Benchmark Llm Evaluation Reasoning Paradigms Benchmark Suite

Paradigm Bench: A Sampled Benchmark Suite for Language Agent Reasoning

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info