KirillNik created a corpus of synthetic tweets generated by three open or API-based large language models. The dataset is designed for controlled-variable studies on detecting machine-generated social-media text, with topics extracted from real human tweets. It was last updated on June 4, 2026.
Use Cases
- Training machine-generated text detectors based on the controlled-variable corpus design.
- Studying the stylistic differences between outputs from multiple LLMs based on the described prompting strategies.
- Analyzing topic-conditioned text generation by holding the subject matter constant across model outputs.
Strengths
- Dataset design holds topics constant across models and prompts, allowing attribution of output differences to the model and prompt.
- Built specifically as a controlled-variable corpus for studying machine-generated text detection.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- huggingface
- Collection Method
- Synthetic tweets generated by three open/API LLMs, conditioned on topics from real human tweets.
- Freshness
- Last updated 2026-06-04 16:46:04; freshness should be verified.