Name: Symbolic Pretraining Pile: Procedurally Generated Data for Formal Reasoning
Creator: reasoning-core
Published: 2025-12-16T10:17:03
Keywords: Formal Logic, Librarypolars, Procedural, Task Categoriesquestion Answering, Librarydask, Languageen, Size Categories10 Mn100 M, Modalitytext, Librarymlcroissant, Librarydatasets, Pretraining, Text, Parquet, Sft, Regionus, Reasoning, Arxiv250918083, Arxiv260302208, Synthetic Data, Licensemit, Mathematical Computation, Synthetic, Formal, Symbolic Reasoning

Description

The Symbolic Pretraining Pile (SPT) is a dataset for symbolic and formal pre-training, mid-training, and supervised fine-tuning. It is procedurally generated on CPU and can be scaled to trillion tokens, with adjustable difficulty. The dataset was created by reasoning-core and last updated on March 23, 2026.

Use Cases

Pre-training language models on formal reasoning tasks based on the described planning and conjecture entailment categories.
Supervised fine-tuning for logic-based natural language inference using the logic_nli task category.
Training models for mathematical computation based on the described arithmetic and equation system tasks.
Generating synthetic data for proof reconstruction and evidence retrieval tasks as mentioned in the description.

Strengths

Data is procedurally generated, allowing for scalable creation up to trillion tokens.
Task difficulty is adjustable via a single parameter, enabling controlled experimentation.
Covers multiple formal reasoning categories including planning, logic, and mathematical computation.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count and file size are unknown, which may limit suitability assessment.
Data is synthetic and procedurally generated, which may not reflect real-world complexity or distribution.

Provenance

Source: huggingface
Collection Method: Procedurally generated on CPU.
Time Range: Synthetic data; no temporal coverage.
Freshness: Last updated 2026-03-23 19:19:21.
Geography: Synthetic data; no spatial coverage.

License is unknown; terms of use must be verified before download.

Symbolic Pretraining Pile: Procedurally Generated Data for Formal Reasoning

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info