DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

ViLegalTexts: A 16GB Vietnamese Legal Pre-training Corpus | DataSalon

Home Government & LegalViLegalTexts: A 16GB Vietnamese Legal Pre-training Corpus

Government & Legal

ViLegalTexts: A 16GB Vietnamese Legal Pre-training Corpus

Name: ViLegalTexts: A 16GB Vietnamese Legal Pre-training Corpus
Creator: ntphuc149
Published: 2026-04-19T08:06:32
Keywords: Pre Training Corpus, Vietnamese, Legal Text, Text, Natural Language Processing

by ntphuc149·Updated 2mo ago

Available on 1 platform

Description

16 GB of Vietnamese legal text crawled from four public repositories: Thư Viện Pháp Luật, Văn Bản Pháp Luật, LuatVietnam, and LawNet. The corpus was created by author ntphuc149 and last updated on 2026-04-19. It was used to pre-train the ViLegalLM model.

Use Cases

Pre-training large language models for the Vietnamese legal domain based on the described corpus.
Fine-tuning models for legal text classification or summarization based on the domain-specific content.
Benchmarking model performance on Vietnamese legal NLP tasks using the source material.
Studying linguistic patterns and terminology within Vietnamese legal documents.

Strengths

Corpus size is explicitly stated as 16 GB.
Sources are clearly named as four specific Vietnamese legal repositories.
The dataset has a defined purpose, having been used to train the ViLegalLM model.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
The description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Crawled from Thư Viện Pháp Luật (TVPL), Văn Bản Pháp Luật (VBPL), LuatVietnam, and LawNet.
Collection Method: Web crawling from publicly available legal repositories.
Time Range: null
Freshness: Last updated 2026-04-19 08:53:27.
Geography: Vietnam

null

Text Pre Training Corpus Vietnamese Legal Text Natural Language Processing

Related Datasets

Quality Score

D38

Description

Source

Reputation

Quality Score

D38

Description

Source

Reputation

Access

Community

1 likes

0 views

Dataset Info

Author: ntphuc149
Created: Apr 19, 2026
Updated: Apr 19, 2026
Last synced: May 1, 2026

Access

Community

1 likes

0 views

Dataset Info

Author: ntphuc149
Created: Apr 19, 2026
Updated: Apr 19, 2026
Last synced: May 1, 2026

ViLegalTexts: A 16GB Vietnamese Legal Pre-training Corpus

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info