Sign in to view source links and access this dataset
Description
Falconer Benchmarks is an open evaluation dataset comparing the Falconer AI assistant against Notion AI, Atlassian Rovo, Claude Code, and Codex. It contains every question, every assistant's full answer, and every LLM-judge score for two scenarios, with no summarization. The dataset was created by FalconerAI and was last updated on June 18, 2026.
Use Cases
Benchmarking AI assistant performance based on the described side-by-side comparison of multiple models.
Analyzing answer quality and consistency based on the provided LLM-judge scores for each response.
Studying model behavior in document-grounded customer support scenarios based on the described 'wix/' folder scenario.
Strengths
Provides complete receipts for evaluation, including every question and every assistant's full answer.
Includes LLM-judge scores for each answer, offering a quantitative performance measure.
Compares multiple prominent AI assistants (Notion AI, Atlassian Rovo, Claude Code, Codex) in a structured benchmark.
Limitations
Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Provenance
Source
FalconerAI
Collection Method
Likely contains evaluation data generated by querying multiple AI assistants and scoring their responses.
Freshness
Last updated 2026-06-18 22:27:53; freshness should be verified.
License is unknown; terms of use must be verified before application.