Name: HealthBench Professional: Medical AI Evaluation Conversations with Physician Rubrics
Creator: openai
Published: 2026-04-21T18:33:51
Keywords: Healthcare, Text, Clinical Benchmark, Medical Evaluation, Rubric Scoring

Description

A dataset for evaluating professional medical AI, created by OpenAI and last updated on April 22, 2026. It contains conversational examples between a user and an assistant, each annotated with physician-assigned rubric items, difficulty ratings, and medical specialties. The examples are categorized by use case and include both good-faith and red-teaming interactions.

Use Cases

Benchmarking AI assistant performance in medical consultations based on annotated conversation examples.
Training or fine-tuning models for medical writing tasks using the categorized 'writing' use case examples.
Evaluating model robustness against adversarial medical queries using the 'red_teaming' type examples.
Analyzing AI performance across medical specialties and difficulty levels based on the provided physician ratings.

Strengths

Examples include physician-assigned difficulty ratings and medical specialty labels.
Each example is annotated with specific rubric items containing criterion text and points.
Data is categorized by distinct use cases (consult, writing, research) and interaction types (good_faith, red_teaming).

Limitations

Row count and total scale of the dataset are unknown.
Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: OpenAI
Collection Method: Likely created for the purpose of evaluating medical AI systems, as indicated by the title and description referencing an 'eval'.
Freshness: Last updated 2026-04-22 16:09:30; freshness should be verified.

License is unknown; terms of use must be verified on the dataset page.

Text Healthcare Clinical Benchmark Medical Evaluation Rubric Scoring

HealthBench Professional: Medical AI Evaluation Conversations with Physician Rubrics

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info