Hebrew Bible Numerical Data Based on the Leningrad Codex
by Shaked, Guy / TwoHillsLab Dataverse·Updated 1mo ago
Available on 1 platform
Sign in to view source links and access this dataset
Description
The Hebrew Bible (Tanakh) is represented in this structured, quantitative dataset extracted from the Leningrad Codex. It provides numerical data points, such as word frequencies and verse metrics, transformed into a machine-readable CSV format for computational analysis. The dataset was created by Guy Shaked of TwoHillsLab Dataverse and was last updated on April 10, 2026.
Use Cases
Perform stylometric authorship analysis based on word frequency patterns.
Conduct statistical analysis of textual structure based on verse and chapter metrics.
Train models for computational linguistics tasks based on the machine-readable text representation.
Compare linguistic features across biblical books based on systematic character and word counts.
Strengths
Data is derived from the authoritative Leningrad Codex source.
Covers the complete Tanakh, providing full textual coverage.
Data is structured in a portable, machine-readable CSV format (UTF-8).
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count and file size are unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
Codex Leningradensis (Leningrad Codex)
Collection Method
Systematic extraction of numerical data points from the source text.
Time Range
Covers the complete Tanakh (Hebrew Bible).
Freshness
Last updated 2026-04-10 16:03:22; freshness should be verified.
Geography
Ancient Near Eastern religious texts.
License is unknown; terms of use must be verified before application.