Name: Multilingual Financial Retrieval Dataset in Arabic, German, and English
Creator: Farizeh
Published: 2026-06-07T11:20:07
Keywords: Financial Text, Text, Multilingual, Regulatory Documents, Retrieval, Finance

Description

171,838 records constructed from 1,400 PDF documents issued by central banks and supervisory authorities. The dataset, created by Farizeh, is designed for multilingual contrastive retrieval tasks. It includes a training split of 150,000 records and a validation split of 21,838 records.

Use Cases

Train multilingual retrieval models based on financial and regulatory text.
Benchmark cross-lingual search performance based on documents from central banks.
Fine-tune language models for financial domain adaptation based on authoritative source material.

Strengths

Contains 171,838 total records, providing substantial scale for training.
Sourced from 1,400 publicly available PDFs from authoritative institutions like Deutsche Bundesbank, ECB, BIS, and SAMA.
Explicitly structured for contrastive retrieval with defined training (150,000) and validation (21,838) splits.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Data may reflect geographic and institutional bias inherent to the selected central banks and authorities.

Provenance

Source: Public PDF documents from central banks, supervisory authorities, and development banks (e.g., Deutsche Bundesbank, ECB, BIS, SAMA).
Collection Method: Constructed from publicly available PDF documents.
Freshness: Last updated 2026-06-07 11:21:38; freshness should be verified.
Geography: Likely covers regions associated with the listed institutions (e.g., Eurozone, Saudi Arabia), but specific coverage is not detailed.

License is unknown; terms of use must be verified before application.

Text Multilingual Financial Text Regulatory Documents Retrieval Finance

Multilingual Financial Retrieval Dataset in Arabic, German, and English

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info