GPT-4 annotated the severity of over 17,500 phenotypic abnormalities catalogued in the Human Phenotype Ontology. The annotations are based on nine clinical characteristics and their frequency, benchmarked against ground-truth labels with a mean recall of 97%. Kitty B. Murphy published the dataset on figshare in May 2026.
Use Cases
- Prioritize phenotypes for gene therapy based on the generated quantitative severity metrics.
- Benchmark other LLMs or automated methods for clinical metadata annotation using the provided ground-truth comparisons.
- Integrate severity scores into phenome-wide analyses to rank phenotypes by impact on health and quality of life.
Strengths
- Annotations cover over 17,500 phenotypic abnormalities across more than 8,600 rare diseases.
- Benchmarking demonstrated strong performance with true positive recall rates ranging from 89% to 100% (mean = 97%).
- The severity scoring system integrates both the nature of nine clinical characteristics and their frequency of occurrence.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- The dataset is 56.2 KB, indicating a limited scope likely containing aggregated scores or metadata rather than raw annotations.
Provenance
- Source
- figshare
- Collection Method
- GPT-4 was employed to annotate severity based on clinical characteristics, with outputs benchmarked against ground-truth labels within the HPO.
- Freshness
- Last updated 2026-05-21 05:44:10; freshness should be verified.