508 evaluation instances across six environment categories (KQ, LC, CD, DD, CI, CR) and three difficulty levels. The dataset is constructed from 322 source documents, including 160 legal articles, 93 civil judgments, and 69 criminal judgments.
Use Cases
- Evaluate the performance of large language models on legal reasoning tasks using the Level I, II, and III difficulty tiers.
- Compare model accuracy across different legal domains by analyzing results from the Civil Judgment Document and Criminal Judgment Document sources.
- Benchmark information extraction and reasoning capabilities across the six environment types (KQ, LC, CD, DD, CI, CR).
Strengths
- Contains 508 evaluation instances distributed across six environment types: KQ (98), LC (62), CD (93), DD (93), CI (93), and CR (69).
- Features a hierarchical structure with three distinct difficulty levels (Level I, Level II, Level III) for benchmarking legal reasoning depth.
- Integrates 322 source documents, specifically 160 legal articles, 93 civil judgment documents, and 69 criminal judgment documents.