DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

HRM-Text: A Cleaned Pretraining Dataset for Efficient Language Model Scaling | DataSalon

Home Genomics & BioinformaticsHRM-Text: A Cleaned Pretraining Dataset for Efficient Language Model Scaling

Genomics & Bioinformatics

HRM-Text: A Cleaned Pretraining Dataset for Efficient Language Model Scaling

Name: HRM-Text: A Cleaned Pretraining Dataset for Efficient Language Model Scaling
Creator: sapientinc
Published: 2026-05-15T09:13:54
Keywords: Cleaned Data, Pretraining, Text, Text Data, Hrm Text

by sapientinc·Updated 1mo ago

Available on 1 platform

Description

HRM-Text is a pre-built dataset for language model pretraining, created by applying data_io cleaning scripts to raw text data. The dataset was uploaded by the organization sapientinc and was last updated on May 21, 2026. It is associated with a research paper titled 'HRM-Text: Efficient Pretraining Beyond Scaling'.

Use Cases

Pretraining language models based on the cleaned text corpus.
Benchmarking efficient pretraining scaling laws based on the HRM-Text methodology.
Studying the impact of data cleaning scripts on pretraining performance.

Strengths

Dataset has been cleaned using specific data_io scripts, suggesting a structured preprocessing pipeline.
Associated with a specific research paper (Wang et al., 2026), providing academic context.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count and total data size are unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: sapientinc
Collection Method: Created by applying data_io cleaning scripts to raw text data.
Freshness: Last updated 2026-05-21 03:34:11; freshness should be verified.

License is unknown; users should verify terms of use before downloading.

Text Cleaned Data Pretraining Text Data Hrm Text

Related Datasets

Quality Score

D39

Description

Source

Reputation

Quality Score

D39

Description

Source

Reputation

Access

Community

1.8K downloads

4 likes

0 views

Dataset Info

Author: sapientinc
Created: May 15, 2026
Updated: May 21, 2026
Last synced: Jul 3, 2026

Access

Community

1.8K downloads

4 likes

0 views

Dataset Info

Author: sapientinc
Created: May 15, 2026
Updated: May 21, 2026
Last synced: Jul 3, 2026

HRM-Text: A Cleaned Pretraining Dataset for Efficient Language Model Scaling

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info