Sign in to view source links and access this dataset
Description
A processed version of the millawell/wikipedia_field_of_science dataset, prepared for retrieval-augmented generation systems with limited context windows. The dataset was created by user Laz4rz and last updated on Hugging Face on June 12, 2024. Longer Wikipedia science articles have been split into smaller entries, with each chunk designed to be around 256 tokens.
Use Cases
Testing retrieval performance in RAG systems based on chunked scientific text.
Benchmarking small-context language models using factual science passages.
Building knowledge bases for scientific Q&A applications using processed Wikipedia content.
Strengths
Chunks are processed for a specific small-context RAG use case, with titles added as prefixes.
A related 512-token chunked version is available, suggesting a structured preparation method.
Dataset was last updated on 2024-06-12, indicating recent maintenance.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count and total dataset size are unknown, which may limit suitability assessment.
The description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
Processed from millawell/wikipedia_field_of_science dataset.
Collection Method
Wikipedia pages were split into smaller token-based chunks, with titles added as prefixes.
Freshness
Last updated 2024-06-12 15:57:16.
License is unknown and should be verified before use.