Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
OLMo Mix 1124 is a collection of data used to train the OLMo-2-1124 models, released in November 2024. The majority of the dataset, 3.70 trillion tokens, comes from the DCLM-Baseline source. It was created by AllenAI and includes components such as ArXiv papers, pes2o, StarCoder, and Algebraic-stack.
License information is provided per component (e.g., CC-BY-4.0, ODC-BY); users must verify compliance for intended use.