Name: Grammar Correction: 125,000 Ungrammatical-Grammatical Sentence Pairs
Creator: agentlans
Published: 2024-12-29T04:48:50
Keywords: Text Pairs, Text, Natural Language Processing, Synthetic Data, Synthetic

Description

The grammar-correction dataset is a refined subset of the liweili/c4_200m dataset, derived from Google's C4_200M Synthetic Dataset for Grammatical Error Correction. It contains 100,000 training and 25,000 validation entries of sentence pairs where the input is ungrammatical and the output is grammatical. The dataset was authored by agentlans and last updated on 2024-12-29.

Use Cases

Train grammatical error correction models based on ungrammatical-grammatical sentence pairs.
Benchmark model performance on synthetic grammatical error correction tasks.
Fine-tune large language models for text refinement and proofreading applications.
Develop educational tools for language learning based on error correction examples.

Strengths

Contains 125,000 total entries, with a defined split of 100,000 for training and 25,000 for validation.
Derived from a known large-scale source, Google's C4_200M Synthetic Dataset for Grammatical Error Correction.
Specifically structured for a clear task, pairing ungrammatical inputs with grammatical outputs.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count beyond the provided splits is unknown, which may limit suitability assessment.
Data may reflect bias inherent to the synthetic generation process of the source dataset.

Provenance

Source: Derived from Google's C4_200M Synthetic Dataset for Grammatical Error Correction via the liweili/c4_200m dataset.
Collection Method: Refined subset of a larger synthetic dataset.
Freshness: Last updated 2024-12-29 04:59:44; freshness should be verified.

License is unknown and should be verified before use.

Text Text Pairs Natural Language Processing Synthetic Data Synthetic

Grammar Correction: 125,000 Ungrammatical-Grammatical Sentence Pairs

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info