Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Over 30,000 prompts from standard benchmarks like MBPP, GSM-8k, and MMLU are used to evaluate responses from 11 different large language models. The dataset, created by 'withmartian', includes each prompt, the model's response, an estimated cost for the response, and a performance score indicating answer correctness. It was published on Hugging Face in March 2024.
License details are unknown and should be verified before use. The full description and potential column details are only available on the linked Hugging Face dataset page.