Name: Propella Annotations: Multilingual Document Labels for LLM Training Data
Creator: openeurollm
Published: 2026-01-12T17:15:05
Keywords: Languageara, Languagedan, Languageglg, Languageest, Languagecat, Languagehin, Languagedeu, Languageben, Languagebul, Languagebos, Languagefra, Languageheb, Languagegsw, Languageell, Document Quality, Languagefin, Languageeus, Languagefas, Text, Multilingual, Languageces, Languageeng, Llm Training, Languagegle, Geospatial, Text Annotation

Description

Propella-1-4b, a small multilingual language model, generated these annotations for text documents across 18 properties. The annotations are organized into six categories, including core content, quality, and safety. The dataset was created by openeurollm and last updated on March 20, 2026.

Use Cases

Filtering training datasets based on document quality and value scores mentioned in the description
Selecting documents for specific audiences based on annotated purpose and audience properties
Curating multilingual text corpora using geographic relevance and language annotations
Screening documents for safety and compliance concerns using the annotated safety category

Strengths

Annotations cover 18 distinct properties across six categories, providing multi-faceted document labels
Annotations were produced by a multilingual model, suggesting applicability to text in multiple languages
Dataset was last updated on 2026-03-20, indicating recent maintenance

Limitations

Column-level documentation is absent; field semantics must be inferred after download
Row count is unknown, which may limit suitability assessment
The dataset description references a full description on an external page, requiring a click-through for complete details

Provenance

Source: huggingface
Collection Method: Annotations generated by the propella-1-4b language model.
Freshness: Last updated 2026-03-20 09:22:36

Propella Annotations: Multilingual Document Labels for LLM Training Data

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info