Description

35,996 clinical text samples in Spanish, averaging about 700 tokens each, form the largest publicly available corpus for clinical NLP research. The dataset aggregates texts from diverse open sources including medical journals, annotated corpora from shared tasks, and supplementary materials. It was created by IIC and last updated on the Hugging Face platform in May 2026.

Use Cases

Train Spanish clinical language models based on the large collection of medical texts.
Benchmark clinical named entity recognition and relation extraction models using the annotated corpora from shared tasks.
Conduct research on medical concept normalization using the aggregated corpus from journals and textbooks.
Develop text classification models for Spanish clinical documents based on the diverse sample collection.

Strengths

35,996 text samples provide a substantial corpus for model training.
The corpus aggregates content from multiple sources, including medical journals and annotated shared task data, suggesting diversity.
An average of ~700 tokens per sample indicates documents of meaningful length for analysis.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Last updated 2026-05-12 07:32:55; freshness should be verified.

Provenance

Source: Aggregated from diverse open sources including medical journals, annotated corpora from shared tasks, Wikipedia, and medical textbooks.
Collection Method: Aggregation of existing public clinical text resources.
Freshness: 2026-05-12 07:32:55

License information is unknown; users should verify terms of use before application.

Text Clinical Nlp Spanish-language Healthcare Natural Language Processing

ClinText-SP: Largest Public Spanish Clinical Corpus for NLP

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info