StarCoder2 3B base model evaluation on the Mostly Basic Python Problems (MBPP) benchmark. The dataset contains raw evaluation metrics, execution telemetry logs, and structural syntax outputs captured from automated conversational pipelines. It was authored by ShahzebKhoso and last updated on May 28, 2026.
Use Cases
- Benchmarking code generation model performance based on MBPP evaluation metrics
- Analyzing model behavioral dynamics based on execution telemetry logs
- Studying structural syntax patterns in generated code outputs
- Comparing foundational model weights in automated conversational pipelines
Strengths
- Focuses on a specific, established benchmark (MBPP) for Python code generation.
- Captures multiple data types including raw metrics, telemetry logs, and syntax outputs.
Limitations
- Description metadata is limited; actual data quality requires manual inspection after download.
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
Provenance
- Source
- huggingface
- Collection Method
- Running the MBPP benchmark against the StarCoder2 3B base model.
- Freshness
- Last updated 2026-05-28 12:58:19; freshness should be verified.