Sign in to view source links and access this dataset
Description
A dataset for evaluating professional medical AI, created by OpenAI and last updated on April 22, 2026. It contains conversational examples between a user and an assistant, each annotated with physician-assigned rubric items, difficulty ratings, and medical specialties. The examples are categorized by use case and include both good-faith and red-teaming interactions.
Use Cases
Benchmarking AI assistant performance in medical consultations based on annotated conversation examples.
Training or fine-tuning models for medical writing tasks using the categorized 'writing' use case examples.
Evaluating model robustness against adversarial medical queries using the 'red_teaming' type examples.
Analyzing AI performance across medical specialties and difficulty levels based on the provided physician ratings.
Strengths
Examples include physician-assigned difficulty ratings and medical specialty labels.
Each example is annotated with specific rubric items containing criterion text and points.
Data is categorized by distinct use cases (consult, writing, research) and interaction types (good_faith, red_teaming).
Limitations
Row count and total scale of the dataset are unknown.
Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
OpenAI
Collection Method
Likely created for the purpose of evaluating medical AI systems, as indicated by the title and description referencing an 'eval'.
Freshness
Last updated 2026-04-22 16:09:30; freshness should be verified.
License is unknown; terms of use must be verified on the dataset page.