509 commercial legal contracts from the CUAD dataset, with PDFs and cleaned extracted text. The dataset was created by dvgodoy and last updated on 2025-01-29. One original contract was removed due to being a scanned copy.
Use Cases
- Train contract clause classification models based on the full text of legal documents.
- Develop PDF-to-text extraction and cleaning pipelines using the provided base64-encoded PDFs and cleaned text.
- Benchmark named entity recognition for legal entities and terms within commercial contracts.
- Analyze the structure and language patterns of commercial legal agreements.
Strengths
- Contains 509 commercial legal contracts, providing a substantial corpus for analysis.
- Includes both the original PDFs (base64 encoded) and the cleaned extracted text for each contract.
- Text was cleaned using the clean-text library, which likely improves consistency for NLP tasks.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- Original CUAD (Contract Understanding Atticus Dataset).
- Collection Method
- PDFs were encoded in base64 and text was extracted and cleaned.
- Freshness
- Last updated 2025-01-29 18:32:20; freshness should be verified.