Sindhi Open Lexicon Dataset with 223,000+ Entries

Name: Sindhi Open Lexicon Dataset with 223,000+ Entries
Creator: SindhiLanguage.org
Published: 2026-05-13T16:16:57
Keywords: Computational Linguistics, Text, Sindhi Language, Lexicon, Large Scale, Natural Language Processing, Low Resource Language

by SindhiLanguage.org / Harvard DataverseUpdated 1mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Over 223,000 structured lexical entries for the Sindhi language, created by SindhiLanguage.org and hosted on Harvard Dataverse. The dataset includes definitions, linguistic metadata, normalized forms, and domain classifications, aiming to support NLP research and preserve linguistic heritage. It was last updated on May 13, 2026.

Use Cases

Training language models for Sindhi based on the large-scale lexicon.
Building search engines or chatbots using the definitions and normalized text forms.
Developing OCR systems for Sindhi text leveraging the variants with and without diacritics.
Supporting computational linguistics research with the provided linguistic metadata and domain classifications.

Strengths

Contains over 223,000 entries, providing substantial lexical coverage for a low-resource language.
Includes multiple features per entry such as definitions, variants, normalized text, and domain classification.
Offers data in multiple structured formats (CSV, JSONL, SQLite), which may aid integration.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment for specific tasks.
Freshness should be verified as the last update timestamp is in the future (2026).

Provenance

Source: SindhiLanguage.org
Freshness: Last updated 2026-05-13 16:16:57

License information is unknown and should be verified before use.

Text Computational Linguistics Sindhi Language Lexicon Large Scale Natural Language Processing Low Resource Language

Related Datasets

Quality Score

D36

Description

37

Source

38

Reputation

35

Access

31

Community

0 views

Dataset Info

Author: SindhiLanguage.org
Org: Harvard Dataverse
Created: May 13, 2026
Updated: May 13, 2026
Last synced: May 23, 2026

Access

31

Community

0 views

Dataset Info

Author: SindhiLanguage.org
Org: Harvard Dataverse
Created: May 13, 2026
Updated: May 13, 2026
Last synced: May 23, 2026

Sindhi Open Lexicon Dataset with 223,000+ Entries

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info