Name: RouterBench: LLM Responses Across Eleven Models
Creator: withmartian
Published: 2024-03-04T21:31:01
Keywords: Size Categories10 Kn100 K, Task Categoriestext Generation, Task Categoriesquestion Answering, Languageen, Code, Model Evaluation, Benchmarking, Text, Ai Performance, Regionus, Doi1057967hf1996, Large Language Models

Description

Over 30,000 prompts from standard benchmarks like MBPP, GSM-8k, and MMLU are used to evaluate responses from 11 different large language models. The dataset, created by 'withmartian', includes each prompt, the model's response, an estimated cost for the response, and a performance score indicating answer correctness. It was published on Hugging Face in March 2024.

Use Cases

Compare answer correctness scores across the 11 different LLMs for the same prompt.
Analyze the relationship between estimated cost and performance score for model responses.
Benchmark a new model's response against the 11 provided LLM responses on standard benchmark prompts.
Study response patterns and errors across models using the prompt and generated text fields.

Strengths

Over 30,000 prompt-response pairs provide a substantial evaluation corpus.
Includes responses from 11 distinct LLMs for comparative analysis.
Prompts are sourced from multiple established benchmarks (e.g., MBPP, GSM-8k, MMLU).

Limitations

Specific row count per model or benchmark is unknown, preventing granular analysis.
The 'estimated cost' column's calculation method and currency are not detailed.
The dataset's composition and potential class imbalance across benchmarks are unspecified.

Provenance

Source: huggingface
Collection Method: Prompts collected from standard benchmarks; model responses generated and scored by the dataset author.
Freshness: Last updated March 2024.

License details are unknown and should be verified before use. The full description and potential column details are only available on the linked Hugging Face dataset page.

Text Size Categories10 Kn100 K Task Categoriestext Generation Task Categoriesquestion Answering Languageen Code Model Evaluation Benchmarking Ai Performance Regionus Doi1057967hf1996 Large Language Models

RouterBench: LLM Responses Across Eleven Models

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info