DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

OpenHermes 2.5: 300,466 High-Perplexity Instruction Samples | DataSalon

Home EducationOpenHermes 2.5: 300,466 High-Perplexity Instruction Samples

Education

OpenHermes 2.5: 300,466 High-Perplexity Instruction Samples

Name: OpenHermes 2.5: 300,466 High-Perplexity Instruction Samples
Creator: Malum0x
Published: 2026-03-07T21:32:45
Keywords: Librarypolars, Modalitytext, Size Categories100 Kn1 M, Modalitytabular, Librarymlcroissant, Librarydatasets, Librarypandas, Regionus, JSON, Licenseapache 20

by Malum0x·Updated 3mo ago

Available on 1 platform

Description

Housing 300,466 high-perplexity text samples filtered from the OpenHermes 2.5 dataset by Malum0x in March 2026. It consists of the top 30% of records that Qwen2.5-3B-Instruct identified as having the highest cross-entropy loss during scoring.

Use Cases

Fine-tuning language models on high-difficulty instruction samples to improve performance on edge cases
Analyzing model loss patterns by inspecting high-perplexity text records
Curating high-signal training data for small language models

Strengths

300,466 filtered samples
Objective filtering via Qwen2.5-3B-Instruct cross-entropy loss
Apache 2.0 licensed

Limitations

High perplexity may capture malformed text or noise rather than just complex logic
Model-specific bias from the Qwen2.5-3B-Instruct evaluator

Provenance

Source: teknium/OpenHermes-2.5
Collection Method: Filtered from source using cross-entropy loss scoring by Qwen2.5-3B-Instruct (4-bit NF4)
Freshness: Last updated March 2026.

Samples were selected based on the 70th percentile loss threshold; users should verify if high perplexity indicates complexity or data noise for their specific use case.

JSON Librarypolars Modalitytext Size Categories100 Kn1 M Modalitytabular Librarymlcroissant Librarydatasets Librarypandas Regionus Licenseapache 20

Related Datasets

Quality Score

D40

Description

Source

Reputation

Quality Score

D40

Description

Source

Reputation

Access

Community

251 downloads

1 likes

0 views

Dataset Info

Author: Malum0x
Created: Mar 7, 2026
Updated: Mar 8, 2026
Last synced: Apr 29, 2026

Access

Community

251 downloads

1 likes

0 views

Dataset Info

Author: Malum0x
Created: Mar 7, 2026
Updated: Mar 8, 2026
Last synced: Apr 29, 2026

OpenHermes 2.5: 300,466 High-Perplexity Instruction Samples

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info