Ro Sft Finepdfs: Romanian PDF Page Images and Extracted Text for VLM Training

Name: Ro Sft Finepdfs: Romanian PDF Page Images and Extracted Text for VLM Training
Creator: OpenLLM-Ro
Published: 2026-06-04T14:45:28
Keywords: Pdf Corpus, Vision Language, Romanian Language, Multimodal Training, Computer Vision, Large Scale, Natural Language Processing, Document Ocr, Multimodal

by OpenLLM-RoUpdated 27d ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

A Romanian-language subset of the FinePDFs corpus, prepared for OCR, containing pairs of page images and extracted text. The dataset is part of an instruction fine-tuning protocol for Romanian Vision-Language Models proposed in the paper 'Înțelegi românește?'. It was created by OpenLLM-Ro and last updated on June 5, 2026.

Use Cases

Train Romanian vision-language models based on image-text pairs.
Fine-tune OCR models for Romanian text extraction from PDFs.
Benchmark multimodal learning techniques on a non-English language.
Develop instruction-following AI agents for Romanian visual content.

Strengths

Derived from FinePDFs, described as the largest publicly available corpus sourced exclusively from PDFs.
Specifically prepared for OCR tasks with aligned image-text pairs.
Part of a documented research protocol for Romanian VLMs.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
The specific size, file formats, and license for this Romanian split are unknown.

Provenance

Source: OpenLLM-Ro, derived from the FinePDFs corpus.
Collection Method: Likely extracted and processed from PDF documents for OCR.
Freshness: Last updated 2026-06-05 06:05:11; freshness should be verified.
Geography: Romanian language focus.

License is unknown, which may restrict commercial use.

Multimodal Pdf Corpus Vision Language Romanian Language Multimodal Training Computer Vision Large Scale Natural Language Processing Document Ocr

Related Datasets

Quality Score

D39

Description

42

Source

36

Reputation

46

Access

26

Community

1.2K downloads

1 likes

0 views

Dataset Info

Author: OpenLLM-Ro
Created: Jun 4, 2026
Updated: Jun 5, 2026
Last synced: Jun 16, 2026

Access

26

Community

1.2K downloads

1 likes

0 views

Dataset Info

Author: OpenLLM-Ro
Created: Jun 4, 2026
Updated: Jun 5, 2026
Last synced: Jun 16, 2026

Ro Sft Finepdfs: Romanian PDF Page Images and Extracted Text for VLM Training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info