60,000 rows of unified training data assembled from multiple source files. The dataset was created by collusion-paper-anon1 and last updated on May 7, 2026. It includes deterministic and shuffled versions for audit and training purposes.
Use Cases
- Training language models based on unified behavioral and role-based data.
- Auditing training data lineage and composition based on deterministic row ordering.
- Shuffling training batches for model training based on a fixed seed.
- Analyzing dataset composition and source contributions based on provided metadata.
Strengths
- Contains 60,000 rows, providing a substantial corpus for training.
- Offers both deterministic and shuffled versions, supporting reproducibility and standard training workflows.
- Includes a manifest file with counts and checksums for data integrity verification.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- huggingface
- Collection Method
- Built from final_assembly/*.jsonl files by a Python script.
- Freshness
- Last updated 2026-05-07 02:25:01; freshness should be verified.