Name: Persian Punctuation Restoration Dataset with 17 Million Samples
Creator: MohammadJRanjbar
Published: 2025-08-06T17:10:41
Keywords: Token Classification, Librarypolars, Sequence Labeling, Punctuation, Arxiv260305314, Persian, Modalitytext, Size Categories100 Kn1 M, Librarymlcroissant, Parsbert, Librarydatasets, Librarypandas, Licensecc By 40, Parquet, Languagefa, Regionus, Natural Language Processing, Farsi, Punctuation Restoration, Task Categoriestoken Classification

Description

PersianPunc is a large-scale dataset for Persian punctuation restoration, containing 17 million token-level sequence labeling samples aggregated from 6 source corpora. It was created by MohammadJRanjbar and accepted at the EACL 2026 SilkRoad NLP Workshop.

Use Cases

Train a sequence labeling model on 17 million samples for Persian punctuation restoration.
Fine-tune a ParsBERT model for token-level punctuation prediction using the large-scale sample data.
Benchmark punctuation restoration models for Persian text using the token-level sequence labeling task.
Analyze punctuation patterns across the 6 aggregated source corpora included in the dataset.

Strengths

17 million samples provide a substantial foundation for model training.
Data is aggregated from 6 distinct source corpora, offering potential diversity.
Dataset is designed for token-level sequence labeling, a standard NLP task format.
Peer-reviewed work accepted at the EACL 2026 SilkRoad NLP Workshop.

Limitations

Specific column structure and features are not described, limiting understanding of data granularity.
The raw description is brief, lacking details on label distribution, potential class imbalance, or data quality metrics.
Geographic and temporal coverage of the source texts is unspecified.

Provenance

Source: MohammadJRanjbar via Hugging Face.
Collection Method: Aggregated from 6 source corpora; specific gathering method not detailed.
Freshness: Last updated on 2026-03-20.
Geography: Persian (Farsi) language data; specific geographic coverage unknown.

License information is not provided in the input; users must verify licensing on the dataset page. The dataset is intended for token classification tasks in Persian.

Persian Punctuation Restoration Dataset with 17 Million Samples

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info