Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Miriad 5.8M contains 5.8 million medical question-answer pairs distilled from peer-reviewed biomedical literature using Large Language Models. Released in June 2025 by the Miriad research team, the dataset provides structured data for medical instruction tuning and retrieval-augmented generation. It serves as a large-scale resource for training models on verified scientific knowledge rather than general web content.
The dataset is distributed in Parquet format and is compatible with the Hugging Face datasets library. Users should refer to Arxiv preprint 2506.06091 for specific details on the distillation methodology.