Sign in to view source links and access this dataset
Description
Nemotron-Personas-Korea is a synthetic persona dataset grounded in real-world demographic, geographic, and personality trait distributions of South Korea. It is the first large-scale Korean-language persona dataset, synthesized using attributes such as name, gender, age, marital status, education level, occupation, and residence region based on official statistics from sources including the Korean Statistical Information Service (KOSIS), the Supreme Court, the National Health Insurance Service, the Rural Economic Research Institute, and NAVER Cloud. The dataset is open-source under a CC BY 4.0 license and was created by NVIDIA.
Use Cases
Training or evaluating AI models for Korean-language dialogue systems based on diverse synthetic personas.
Benchmarking demographic fairness or bias in AI systems using personas reflecting real Korean population distributions.
Generating synthetic user profiles for simulation studies in social science or marketing research based on described demographic attributes.
Augmenting datasets for natural language processing tasks requiring realistic Korean personal contexts.
Strengths
The dataset is synthesized based on official statistics from multiple authoritative Korean institutions, including KOSIS, the Supreme Court, and the National Health Insurance Service.
It is described as the first large-scale Korean-language persona dataset, designed to broadly reflect the diversity and characteristics of the Korean population.
Attributes include name, gender, age, marital status, education level, occupation, and residence region, suggesting a multi-dimensional representation.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count, file formats, and sample data are unknown, which limits suitability assessment.
The dataset is synthetic; its fidelity to real-world distributions requires validation.
Provenance
Source
Korean Statistical Information Service (KOSIS), Supreme Court, National Health Insurance Service, Rural Economic Research Institute, NAVER Cloud
Collection Method
Synthetic generation based on real-world statistical distributions.
Freshness
Last updated 2026-04-20 23:22:50; freshness should be verified.
Geography
South Korea
License is listed as unknown in the input, but the description mentions CC BY 4.0; users should verify the license on the dataset page.