Sign in to view source links and access this dataset
Description
Omnibook is a collection of general culture texts compiled by OvercastLab. It includes complete works, historical scientific texts, philosophy, political thought, poetry, drama, and essays sourced from Project Gutenberg. The dataset was last updated on 2026-05-11.
Use Cases
Train language models on general culture based on the collection of complete works.
Fine-tune models for literary analysis based on poetry, drama, and essays.
Study the evolution of ideas based on historical scientific texts and philosophy.
Generate text in historical styles based on unmodified original sources.
Strengths
Texts are manually proofread by Project Gutenberg, indicating high fidelity.
Sources are from Project Gutenberg, a known repository of public domain texts.
Dataset includes deduplication of near-identical editions.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
The description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
Project Gutenberg
Collection Method
Texts are unmodified from original Project Gutenberg sources, with boilerplate headers/footers removed.
Freshness
Last updated 2026-05-11 17:12:16.
License is unknown; terms of use must be verified.