Sign in to view source links and access this dataset
Description
Open-source benchmark for browser AI agents on 153 everyday online tasks across 144 live websites. The benchmark uses a 5-layer recording system, DOM-match, and an LLM judge for evaluation, with a reported top score of 33.3%. It was created by reacher-z and last updated on 2026-05-19.
Use Cases
Benchmarking the performance of browser-based AI agents based on the 153 defined tasks.
Evaluating agent robustness and generalization across 144 different live websites.
Developing new evaluation methods for web agents based on the 5-layer recording and LLM judge framework.
Strengths
Benchmarks performance on 153 distinct, everyday online tasks.
Evaluates agents across 144 live websites, providing real-world context.
Employs a multi-faceted evaluation method combining 5-layer recording, DOM-match, and an LLM judge.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count and dataset scale are unknown, which may limit suitability assessment.
The top reported score of 33.3% suggests the tasks present a significant challenge, indicating potential performance limitations for current agents.
Provenance
Source
github
Collection Method
Likely involves automated task recording and evaluation on live websites.