FormalRx-Test is the official test split of the FormalRx diagnostic evaluation framework (Wang et al., 2025). It contains 7,030 natural-language / Lean 4 statement pairs annotated under the SCI Error Taxonomy (Semantic, Constraint, Implementation). The dataset was created by LARK-Lab and last updated on HuggingFace in May 2026.
Use Cases
- Evaluate autoformalization model performance based on alignment verdicts.
- Diagnose and categorize errors in formalized statements based on the SCI Error Taxonomy.
- Localize errors within natural-language / Lean 4 statement pairs.
- Benchmark diagnostic capabilities of formalization tools.
- Train models to generate actionable feedback for formalization tasks.
Strengths
- Provides 7,030 annotated statement pairs for evaluation.
- Supports four diagnostic capabilities: alignment verdicts, error categorization, error localization, and one unspecified capability.
- Annotated under a structured SCI Error Taxonomy (Semantic, Constraint, Implementation).
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is known (7,030), but file formats, size, and license are unknown.
- The dataset is a test split; the full training or development data is not included here.
Provenance
- Source
- LARK-Lab
- Collection Method
- Likely created as part of the FormalRx framework research (Wang et al., 2025).
- Freshness
- Last updated 2026-05-07 03:34:00; freshness should be verified.