The Eval Cards Backend Dataset contains pre-computed evaluation data for 5,678 models across 798 benchmarks. Generated by the eval-cards backend pipeline, it powers the Eval Cards frontend and includes 1,321 metric-level evaluations. The dataset was last generated on May 5, 2026.
Use Cases
- Compare model performance across different benchmarks based on the 798 benchmark evaluations.
- Analyze metric-level results for specific models based on the 1,321 metric-level evaluations.
- Track evaluation trends over time based on the pipeline-generated data.
- Validate new model outputs against established benchmark results.
Strengths
- Includes evaluations for 5,678 distinct models.
- Covers 798 different benchmarks.
- Provides 1,321 metric-level evaluation records.
- Contains metadata for 240 benchmark cards.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Last updated 2026-06-03 09:29:44; freshness should be verified.
Provenance
- Source
- Generated by the eval-cards backend pipeline.
- Collection Method
- Pre-computed evaluation data.
- Freshness
- Last generated 2026-05-05T11:30:42.961096Z