Sign in to view source links and access this dataset
Description
A processed version of the millawell/wikipedia_field_of_science dataset, prepared for retrieval-augmented generation systems with limited context windows. The author Laz4rz split longer Wikipedia pages into smaller entries, with each chunk targeting approximately 512 tokens and the page title added as a prefix. The dataset was last updated on June 12, 2024.
Use Cases
Benchmarking RAG system performance on scientific queries based on the chunked Wikipedia text.
Training or fine-tuning embedding models for scientific document retrieval based on the 512-token text chunks.
Developing question-answering systems for science topics based on the processed Wikipedia entries.
Strengths
Chunks are processed for a specific small-context RAG use case, with titles prefixed to entries.
A variant with 256-token chunks is also available, offering flexibility for different model constraints.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
Processed from millawell/wikipedia_field_of_science dataset on Hugging Face.
Collection Method
Longer Wikipedia pages were split into smaller entries, with titles added as a prefix.
Freshness
Last updated 2024-06-12 15:57:20; freshness should be verified.
License is unknown and should be verified before use.