Belgin PII: Synthetic Turkish Hard-Case Data for Privacy-Sensitive Span Detection

Name: Belgin PII: Synthetic Turkish Hard-Case Data for Privacy-Sensitive Span Detection
Creator: negentropi
Published: 2026-05-19T15:07:53
Keywords: Text, Turkish Nlp, Privacy Preserving, Pii Detection, Synthetic Data, Synthetic

by negentropiUpdated 1mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

A fully synthetic dataset for testing and improving PII masking systems in Turkish text. The dataset, created by author 'negentropi', is designed to detect names, addresses, emails, phones, dates, account identifiers, URLs, and secrets within complaint or support-style narratives. It was last updated on 2026-05-19.

Use Cases

Benchmarking PII detection models based on the described hard-case synthetic examples.
Improving masking systems for Turkish text based on the described span detection targets.
Training models to identify privacy-sensitive entities like emails and account identifiers in support-style text.

Strengths

Dataset is specifically designed for 'hard-case' scenarios to challenge PII detection systems.
Focuses on multiple PII types including names, addresses, emails, phones, dates, and secrets.
Public release excludes raw complaint text and private user content, mitigating some privacy risks.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Dataset is fully synthetic, which may not fully capture the distribution of real-world PII occurrences.

Provenance

Source: huggingface
Collection Method: Synthetically generated.
Time Range: null
Freshness: Last updated 2026-05-19 15:26:19; freshness should be verified.
Geography: null

License is unknown; terms of use must be verified before application.

Text Turkish Nlp Privacy Preserving Pii Detection Synthetic Data Synthetic

Related Datasets

Quality Score

D37

Description

42

Source

36

Reputation

35

Access

26

Community

1 likes

0 views

Dataset Info

Author: negentropi
Created: May 19, 2026
Updated: May 19, 2026
Last synced: May 26, 2026

Access

26

Community

1 likes

0 views

Dataset Info

Author: negentropi
Created: May 19, 2026
Updated: May 19, 2026
Last synced: May 26, 2026

Belgin PII: Synthetic Turkish Hard-Case Data for Privacy-Sensitive Span Detection

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info