DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Acronym Identification: 10K-100K Expert-Annotated Scientific Sentences | DataSalon

Home Multimodal & LLMAcronym Identification: 10K-100K Expert-Annotated Scientific Sentences

Multimodal & LLM

Acronym Identification: 10K-100K Expert-Annotated Scientific Sentences

Name: Acronym Identification: 10K-100K Expert-Annotated Scientific Sentences
Creator: amirveyseh
Published: 2022-03-02T23:29:22
Keywords: Source Datasetsoriginal, Size Categories10 Kn100 K, Librarypolars, Languageen, Language Creatorsfound, Modalitytext, Librarymlcroissant, Librarydatasets, Librarypandas, Parquet, Arxiv201014678, Regionus, Multilingualitymonolingual, Acronym Identification, Licensemit, Annotations Creatorsexpert Generated, Task Categoriestoken Classification

by amirveyseh·Updated 2y ago

Available on 1 platform

Description

Between 10,000 and 100,000 expert-annotated sentences comprise this dataset for token-level acronym identification in the scientific domain. Created by Amirveyseh for the AAAI-21 Workshop on Scientific Document Understanding, it includes standardized training, validation, and test splits.

Use Cases

Training token classification models to identify acronym labels within scientific text
Benchmarking sequence labeling algorithms using the provided training, validation, and test splits
Developing information extraction systems to isolate acronym tokens from pre-tokenized sentences

Strengths

Expert-generated annotations from the AAAI-21 SDU workshop
Scale of 10,000 to 100,000 records
MIT license allows for open research and commercial use

Limitations

Restricted to scientific document domain
Monolingual English coverage

Provenance

Source: AAAI-21 Workshop on Scientific Document Understanding
Collection Method: Expert-annotated
Time Range: 2021
Freshness: Last updated January 2024

Released under the MIT license; requires tools compatible with Parquet files such as Pandas or Polars.

Parquet Source Datasetsoriginal Size Categories10 Kn100 K Librarypolars Languageen Language Creatorsfound Modalitytext Librarymlcroissant Librarydatasets Librarypandas Arxiv201014678 Regionus Multilingualitymonolingual Acronym Identification Licensemit Annotations Creatorsexpert Generated Task Categoriestoken Classification

Related Datasets

Quality Score

D37

Description

Source

Reputation

Quality Score

D37

Description

Source

Reputation

Access

Community

14.2K downloads

23 likes

0 views

Dataset Info

Author: amirveyseh
Created: Mar 2, 2022
Updated: Jan 9, 2024
Last synced: Jul 26, 2026

Access

Community

14.2K downloads

23 likes

0 views

Dataset Info

Author: amirveyseh
Created: Mar 2, 2022
Updated: Jan 9, 2024
Last synced: Jul 26, 2026

Acronym Identification: 10K-100K Expert-Annotated Scientific Sentences

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info