Sign in to view source links and access this dataset
Description
The grammar-correction dataset is a refined subset of the liweili/c4_200m dataset, derived from Google's C4_200M Synthetic Dataset for Grammatical Error Correction. It contains 100,000 training and 25,000 validation entries of sentence pairs where the input is ungrammatical and the output is grammatical. The dataset was authored by agentlans and last updated on 2024-12-29.
Use Cases
Train grammatical error correction models based on ungrammatical-grammatical sentence pairs.
Benchmark model performance on synthetic grammatical error correction tasks.
Fine-tune large language models for text refinement and proofreading applications.
Develop educational tools for language learning based on error correction examples.
Strengths
Contains 125,000 total entries, with a defined split of 100,000 for training and 25,000 for validation.
Derived from a known large-scale source, Google's C4_200M Synthetic Dataset for Grammatical Error Correction.
Specifically structured for a clear task, pairing ungrammatical inputs with grammatical outputs.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count beyond the provided splits is unknown, which may limit suitability assessment.
Data may reflect bias inherent to the synthetic generation process of the source dataset.
Provenance
Source
Derived from Google's C4_200M Synthetic Dataset for Grammatical Error Correction via the liweili/c4_200m dataset.
Collection Method
Refined subset of a larger synthetic dataset.
Freshness
Last updated 2024-12-29 04:59:44; freshness should be verified.
License is unknown and should be verified before use.