Wikipedia Science Articles Chunked for Small-Context RAG Systems

Name: Wikipedia Science Articles Chunked for Small-Context RAG Systems
Creator: Laz4rz
Published: 2024-06-12T14:57:00
Keywords: Rag, Text, Text Chunks, Wikipedia, Science

by Laz4rzUpdated 2y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

A processed version of the millawell/wikipedia_field_of_science dataset, prepared for retrieval-augmented generation systems with limited context windows. The author Laz4rz split longer Wikipedia pages into smaller entries, with each chunk targeting approximately 512 tokens and the page title added as a prefix. The dataset was last updated on June 12, 2024.

Use Cases

Benchmarking RAG system performance on scientific queries based on the chunked Wikipedia text.
Training or fine-tuning embedding models for scientific document retrieval based on the 512-token text chunks.
Developing question-answering systems for science topics based on the processed Wikipedia entries.

Strengths

Chunks are processed for a specific small-context RAG use case, with titles prefixed to entries.
A variant with 256-token chunks is also available, offering flexibility for different model constraints.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Processed from millawell/wikipedia_field_of_science dataset on Hugging Face.
Collection Method: Longer Wikipedia pages were split into smaller entries, with titles added as a prefix.
Freshness: Last updated 2024-06-12 15:57:20; freshness should be verified.

License is unknown and should be verified before use.

Text Rag Text Chunks Wikipedia Science

Related Datasets

Quality Score

D34

Description

42

Source

36

Reputation

20

Access

26

Community

45 downloads

4 likes

0 views

Dataset Info

Author: Laz4rz
Created: Jun 12, 2024
Updated: Jun 12, 2024
Last synced: Jul 2, 2026

Access

26

Community

45 downloads

4 likes

0 views

Dataset Info

Author: Laz4rz
Created: Jun 12, 2024
Updated: Jun 12, 2024
Last synced: Jul 2, 2026

Wikipedia Science Articles Chunked for Small-Context RAG Systems

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info