by SindhiLanguage.org / Harvard Dataverse·Updated 1mo ago
Available on 1 platform
Sign in to view source links and access this dataset
Description
Over 223,000 structured lexical entries for the Sindhi language, created by SindhiLanguage.org and hosted on Harvard Dataverse. The dataset includes definitions, linguistic metadata, normalized forms, and domain classifications, aiming to support NLP research and preserve linguistic heritage. It was last updated on May 13, 2026.
Use Cases
Training language models for Sindhi based on the large-scale lexicon.
Building search engines or chatbots using the definitions and normalized text forms.
Developing OCR systems for Sindhi text leveraging the variants with and without diacritics.
Supporting computational linguistics research with the provided linguistic metadata and domain classifications.
Strengths
Contains over 223,000 entries, providing substantial lexical coverage for a low-resource language.
Includes multiple features per entry such as definitions, variants, normalized text, and domain classification.
Offers data in multiple structured formats (CSV, JSONL, SQLite), which may aid integration.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment for specific tasks.
Freshness should be verified as the last update timestamp is in the future (2026).
Provenance
Source
SindhiLanguage.org
Freshness
Last updated 2026-05-13 16:16:57
License information is unknown and should be verified before use.