16 GB of Vietnamese legal text crawled from four public repositories: Thư Viện Pháp Luật, Văn Bản Pháp Luật, LuatVietnam, and LawNet. The corpus was created by author ntphuc149 and last updated on 2026-04-19. It was used to pre-train the ViLegalLM model.
Use Cases
- Pre-training large language models for the Vietnamese legal domain based on the described corpus.
- Fine-tuning models for legal text classification or summarization based on the domain-specific content.
- Benchmarking model performance on Vietnamese legal NLP tasks using the source material.
- Studying linguistic patterns and terminology within Vietnamese legal documents.
Strengths
- Corpus size is explicitly stated as 16 GB.
- Sources are clearly named as four specific Vietnamese legal repositories.
- The dataset has a defined purpose, having been used to train the ViLegalLM model.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- The description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- Crawled from Thư Viện Pháp Luật (TVPL), Văn Bản Pháp Luật (VBPL), LuatVietnam, and LawNet.
- Collection Method
- Web crawling from publicly available legal repositories.
- Time Range
- null
- Freshness
- Last updated 2026-04-19 08:53:27.
- Geography
- Vietnam