Wikipedia Science Articles Chunked for Small-Context RAG Systems

Name: Wikipedia Science Articles Chunked for Small-Context RAG Systems
Creator: Laz4rz
Published: 2024-06-12T15:50:08
Keywords: Rag, Text, Text Chunks, Wikipedia, Science

by Laz4rzUpdated 2y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

A processed version of the millawell/wikipedia_field_of_science dataset, prepared for retrieval-augmented generation systems with limited context windows. The dataset was created by user Laz4rz and last updated on Hugging Face on June 12, 2024. Longer Wikipedia science articles have been split into smaller entries, with each chunk designed to be around 256 tokens.

Use Cases

Testing retrieval performance in RAG systems based on chunked scientific text.
Benchmarking small-context language models using factual science passages.
Building knowledge bases for scientific Q&A applications using processed Wikipedia content.

Strengths

Chunks are processed for a specific small-context RAG use case, with titles added as prefixes.
A related 512-token chunked version is available, suggesting a structured preparation method.
Dataset was last updated on 2024-06-12, indicating recent maintenance.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count and total dataset size are unknown, which may limit suitability assessment.
The description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Processed from millawell/wikipedia_field_of_science dataset.
Collection Method: Wikipedia pages were split into smaller token-based chunks, with titles added as prefixes.
Freshness: Last updated 2024-06-12 15:57:16.

License is unknown and should be verified before use.

Text Rag Text Chunks Wikipedia Science

Related Datasets

Quality Score

D34

Description

42

Source

36

Reputation

19

Access

26

Community

34 downloads

3 likes

0 views

Dataset Info

Author: Laz4rz
Created: Jun 12, 2024
Updated: Jun 12, 2024
Last synced: Jul 2, 2026

Access

26

Community

34 downloads

3 likes

0 views

Dataset Info

Author: Laz4rz
Created: Jun 12, 2024
Updated: Jun 12, 2024
Last synced: Jul 2, 2026

Wikipedia Science Articles Chunked for Small-Context RAG Systems

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info