Sign in to view source links and access this dataset
Description
35,996 clinical text samples in Spanish, averaging about 700 tokens each, form the largest publicly available corpus for clinical NLP research. The dataset aggregates texts from diverse open sources including medical journals, annotated corpora from shared tasks, and supplementary materials. It was created by IIC and last updated on the Hugging Face platform in May 2026.
Use Cases
Train Spanish clinical language models based on the large collection of medical texts.
Benchmark clinical named entity recognition and relation extraction models using the annotated corpora from shared tasks.
Conduct research on medical concept normalization using the aggregated corpus from journals and textbooks.
Develop text classification models for Spanish clinical documents based on the diverse sample collection.
Strengths
35,996 text samples provide a substantial corpus for model training.
The corpus aggregates content from multiple sources, including medical journals and annotated shared task data, suggesting diversity.
An average of ~700 tokens per sample indicates documents of meaningful length for analysis.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Last updated 2026-05-12 07:32:55; freshness should be verified.
Provenance
Source
Aggregated from diverse open sources including medical journals, annotated corpora from shared tasks, Wikipedia, and medical textbooks.
Collection Method
Aggregation of existing public clinical text resources.
Freshness
2026-05-12 07:32:55
License information is unknown; users should verify terms of use before application.