Sign in to view source links and access this dataset
Description
27 public-domain educational texts published before 1930 form this supervised fine-tuning dataset. The texts, sourced from the Internet Archive, span natural science, history, law, philosophy, and grammar, and are written in a question-and-answer catechism format. The dataset was created by zachnorton03 and last updated on June 19, 2026.
Use Cases
Instruction tuning of language models based on the question-and-answer format described.
Training models on historical and formal English language styles based on the pre-1930 texts.
Developing educational chatbots using the structured, pedagogical content from the source materials.
Analyzing the evolution of language and knowledge presentation across disciplines mentioned in the description.
Strengths
Derived from 27 distinct source texts, providing a multi-disciplinary corpus.
Texts are in a structured question-and-answer format, which is naturally suited for instruction tuning.
All source texts are in the public domain, simplifying legal use and redistribution.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count and dataset size are unknown, which may limit suitability assessment.
Data may reflect temporal and disciplinary bias inherent to the selected pre-1930 educational texts.
Provenance
Source
Internet Archive
Collection Method
Derived from 27 public-domain educational texts.
Time Range
Texts published before 1930, spanning the 19th and early 20th centuries.
Freshness
Last updated 2026-06-19 18:28:46; freshness should be verified.
License is unknown; users should verify the license status before use.