Sign in to view source links and access this dataset
Description
171,838 records constructed from 1,400 PDF documents issued by central banks and supervisory authorities. The dataset, created by Farizeh, is designed for multilingual contrastive retrieval tasks. It includes a training split of 150,000 records and a validation split of 21,838 records.
Use Cases
Train multilingual retrieval models based on financial and regulatory text.
Benchmark cross-lingual search performance based on documents from central banks.
Fine-tune language models for financial domain adaptation based on authoritative source material.
Strengths
Contains 171,838 total records, providing substantial scale for training.
Sourced from 1,400 publicly available PDFs from authoritative institutions like Deutsche Bundesbank, ECB, BIS, and SAMA.
Explicitly structured for contrastive retrieval with defined training (150,000) and validation (21,838) splits.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Data may reflect geographic and institutional bias inherent to the selected central banks and authorities.
Provenance
Source
Public PDF documents from central banks, supervisory authorities, and development banks (e.g., Deutsche Bundesbank, ECB, BIS, SAMA).
Collection Method
Constructed from publicly available PDF documents.
Freshness
Last updated 2026-06-07 11:21:38; freshness should be verified.
Geography
Likely covers regions associated with the listed institutions (e.g., Eurozone, Saudi Arabia), but specific coverage is not detailed.
License is unknown; terms of use must be verified before application.