Sign in to view source links and access this dataset
Description
A fully synthetic dataset for testing and improving PII masking systems in Turkish text. The dataset, created by author 'negentropi', is designed to detect names, addresses, emails, phones, dates, account identifiers, URLs, and secrets within complaint or support-style narratives. It was last updated on 2026-05-19.
Use Cases
Benchmarking PII detection models based on the described hard-case synthetic examples.
Improving masking systems for Turkish text based on the described span detection targets.
Training models to identify privacy-sensitive entities like emails and account identifiers in support-style text.
Strengths
Dataset is specifically designed for 'hard-case' scenarios to challenge PII detection systems.
Focuses on multiple PII types including names, addresses, emails, phones, dates, and secrets.
Public release excludes raw complaint text and private user content, mitigating some privacy risks.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Dataset is fully synthetic, which may not fully capture the distribution of real-world PII occurrences.
Provenance
Source
huggingface
Collection Method
Synthetically generated.
Time Range
null
Freshness
Last updated 2026-05-19 15:26:19; freshness should be verified.
Geography
null
License is unknown; terms of use must be verified before application.