Sign in to view source links and access this dataset
Description
The MetaQuery dataset, created by author xcpan, was last updated on June 30, 2025. It is sourced from the MMC4 dataset and contains 2.4 million instruction-response pairs, likely for training or fine-tuning AI models. The data is released under multiple licenses, including CC-BY-NC, ODC-BY, and Common Crawl terms of use.
Use Cases
Fine-tuning language models for instruction-following based on the 2.4M instruction-response pairs.
Training multimodal AI systems based on the dataset's likely combination of text and other modalities.
Benchmarking model performance on instruction-based tasks using the structured query-response format.
Strengths
Contains 2.4 million instruction-response pairs, providing a substantial scale for model training.
Sourced from the established MMC4 dataset, suggesting a known origin.
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
Sourced from the MMC4 dataset.
Collection Method
Likely involves processing and structuring data from MMC4 into instruction-response pairs.
Freshness
Last updated 2025-06-30 19:22:09; freshness should be verified.
The dataset uses a CC-BY-NC (Non-Commercial) license and incorporates third-party content subject to other licenses, requiring careful review of legal obligations.