Legal LLM Benchmark: Safety and Utility Evaluations for 12 Models

Name: Legal LLM Benchmark: Safety and Utility Evaluations for 12 Models
Creator: marvintong
Published: 2025-11-12T01:01:01
Keywords: Size Categories1 Kn10 K, Task Categoriestext Generation, Safety, Librarypolars, Task Categoriesquestion Answering, Languageen, Modalitytext, Modalitytabular, Librarymlcroissant, Librarydatasets, Benchmark, Librarypandas, Llm Evaluation, Over Refusal, Regionus, Legal, Task Categoriestext Classification, JSON, Licensemit

by marvintongUpdated 7mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

1,000 to 10,000 records benchmark safety-utility trade-offs across 12 Large Language Models in the legal domain, published by marvintong in 2025. The data includes legal questions, multi-phase evaluations, and contract text to measure model performance and over-refusal tendencies. It is structured into distinct subsets for questions, evaluations, and legal documents.

Use Cases

Benchmarking legal question-answering accuracy using the 'questions' subset
Measuring model over-refusal rates by analyzing 'phase1_evaluations' and 'phase3_evaluations'
Testing legal document understanding and information extraction using the 'contracts' data

Strengths

Evaluates 12 different Large Language Models
Scale of 1,000 to 10,000 records
Includes multi-phase evaluation data (Phase 1 and Phase 3)

Limitations

Geographic scope is restricted to the United States legal system
Lack of detailed column-level documentation in the primary summary

Provenance

Source: marvintong via Hugging Face
Freshness: Last updated in November 2025.
Geography: United States

Requires the Hugging Face datasets library; users can load specific subsets like 'questions' or 'phase1_evaluations' independently using the load_dataset function. The dataset is released under the MIT license.

JSON Size Categories1 Kn10 K Task Categoriestext Generation Safety Librarypolars Task Categoriesquestion Answering Languageen Modalitytext Modalitytabular Librarymlcroissant Librarydatasets Benchmark Librarypandas Llm Evaluation Over Refusal Regionus Legal Task Categoriestext Classification Licensemit

Related Datasets

Quality Score

D37

Description

46

Source

36

Reputation

30

Access

22

Community

168 downloads

1 likes

0 views

Dataset Info

Author: marvintong
Created: Nov 12, 2025
Updated: Nov 12, 2025

Access

22

Community

168 downloads

1 likes

0 views

Dataset Info

Author: marvintong
Created: Nov 12, 2025
Updated: Nov 12, 2025

Legal LLM Benchmark: Safety and Utility Evaluations for 12 Models

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info