Nuclear Decision-Making Benchmark: Geopolitical Scenarios for Evaluating LLMs
by Pollack, Martin / Meridian Hill·Updated 1mo ago
Available on 1 platform
Sign in to view source links and access this dataset
Description
The Nuclear Decision-Making Benchmark is one of the first benchmarks for large language models covering nuclear escalation, non-proliferation, proliferation, and arms control domains. Developed with subject-matter experts in international relations, it presents geopolitical scenarios as multiple-choice questions with exchangeable country pairs and phrasings. The benchmark includes evaluation results from seven state-of-the-art large language models.
Use Cases
Benchmarking LLM performance on nuclear policy questions based on the described multiple-choice scenarios.
Analyzing LLM tendencies in escalation and arms control based on the described geopolitical scenarios.
Evaluating the consistency of LLM responses across different country pairings and phrasings as mentioned in the description.
Comparing the outputs of state-of-the-art models like GPT-5.2 and Gemini 3 Pro Preview on non-proliferation topics.
Strengths
Developed with subject-matter experts in international relations, suggesting domain expertise.
Includes evaluation results from seven state-of-the-art large language models, providing a comparative baseline.
Features exchangeable country pairs and phrasings in its scenarios, which may allow for testing robustness.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment for large-scale training.
Description metadata is limited; actual data quality and structure require manual inspection after download.
Provenance
Source
Pollack, Martin; Meridian Hill
Collection Method
Developed with subject-matter experts; scenarios presented as multiple-choice questions.
Freshness
Last updated 2026-05-05 17:46:29; freshness should be verified.
License is unknown; terms of use must be verified before application.