Mendel Dataset is a bilingual collection for training language models on Mendelian genetics. It contains instruction-input-output triples in Alpaca format, describing genotypes, phenotypes, and inheritance rules for animals. The dataset was created by author taylonmcfly and was last updated on Hugging Face in February 2026.
Use Cases
- Fine-tuning language models for genetics question-answering based on described instruction-response pairs.
- Training models to generate Punnett square predictions based on described inheritance rules.
- Developing educational chatbots for bilingual biology instruction based on the dataset's English and Russian content.
- Benchmarking model performance on structured reasoning tasks in genetics based on the Alpaca-format examples.
Strengths
- Dataset is bilingual, containing examples in both English and Russian.
- Examples are structured in the Alpaca format, which is a common standard for instruction-following data.
- License is explicitly stated as Apache-2.0, providing clear usage rights.
Limitations
- Dataset size is categorized as 'n<1K', indicating it contains fewer than 1,000 examples.
- Column-level documentation is absent; field semantics must be inferred after download.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- huggingface
- Freshness
- Last updated 2026-02-07 18:59:51; freshness should be verified.