Sign in to view source links and access this dataset
Description
Approximately 9,769 zero-shot prompt samples aggregated from 5 source benchmarks for evaluating large language models on legal reasoning tasks. The dataset was created by author nguha and last updated on 2026-04-17. It consolidates 202 distinct tasks from benchmarks including legalbench, barexam, lexam, housingqa, and legal_hallucinations.
Use Cases
Benchmarking model performance on legal reasoning based on the aggregated tasks from 5 source benchmarks.
Conducting cost-efficient LLM evaluation based on the pre-formatted zero-shot prompts.
Comparing model results across different legal task types based on the unified flat schema.
Analyzing model performance on specific legal domains based on the source benchmark and task_name fields.
Strengths
Aggregates approximately 9,769 samples across 202 tasks, providing a substantial testbed.
Consolidates data from 5 distinct source benchmarks into a single flat schema.
Samples are pre-formatted as zero-shot prompts, ready for direct model input.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Last updated 2026-04-17 05:24:43; freshness should be verified.
Provenance
Source
Aggregated from 5 benchmarks: legalbench, barexam, lexam, housingqa, and legal_hallucinations.
Collection Method
Unified aggregation into a single flat schema.
Freshness
2026-04-17
License is unknown and should be verified before use.