Sign in to view source links and access this dataset
Description
MERRIN is a human-annotated benchmark for evaluating search-augmented agents on multi-hop reasoning over noisy, multimodal web sources. It measures agents' ability to identify relevant modalities, retrieve evidence from the open web, and reason over conflicting sources spanning text, images, video, and audio. The dataset was created by HanNight and was last updated on 2026-04-16.
Use Cases
Benchmarking multimodal search-augmented agents based on the described multi-hop reasoning tasks.
Evaluating an agent's ability to identify relevant modalities without explicit cues, as described in the dataset's purpose.
Testing retrieval and reasoning over noisy, conflicting, and incomplete sources spanning text, images, video, and audio.
Strengths
Human-annotated benchmark, which suggests a level of curated quality for evaluation.
Designed to measure specific, complex agent capabilities: modality identification, evidence retrieval, and reasoning over noise.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count, file formats, and license are unknown, which may limit suitability assessment.
Provenance
Source
HanNight via Hugging Face.
Collection Method
Human-annotated, likely gathered from noisy web sources.
Freshness
Last updated 2026-04-16 02:20:14; freshness should be verified.
License is unknown; users must verify permissions before use.