Description

A Korean language dataset constructed for supervised fine-tuning (SFT) of large language models as part of a Sungkyunkwan University industry-academic cooperation project. The dataset was created by preprocessing and filtering data from sources including Stanford Alpaca and OIG-Chip2 using ChatGPT-3.5 Turbo 16k to improve naturalness. The dataset page was last updated on 2023-09-25.

Use Cases

Supervised fine-tuning of Korean LLMs based on the described instruction-response pairs.
Training models for Korean conversational AI using the processed assistant-style data.
Benchmarking model performance on Korean instruction-following tasks.
Research on cross-lingual transfer learning using the adapted Stanford Alpaca data.

Strengths

Includes at least 21,155 entries from the 'koalpaca v1.1' subset mentioned in the description.
Explicitly processed for naturalness using ChatGPT-3.5 Turbo 16k on source data.
Filtered to remove errors like '<unk>' tokens and empty inputs from translation processes.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count for the full dataset is unknown, which may limit suitability assessment.
Last updated 2023-09-25 08:36:04; freshness should be verified.

Provenance

Source: Sungkyunkwan University industry-academic project, with data derived from sources like Stanford Alpaca and OIG-Chip2.
Collection Method: Preprocessed and filtered using ChatGPT-3.5 Turbo 16k; specific entries from Open Assistant and Stanford translation data were removed.
Freshness: Last updated 2023-09-25 08:36:04.

License is unknown; terms of use must be verified before application.

Text Rlhf Korean Language Text Generation Sft

Korean RLHF Dataset: Instruction-Tuned Data for Korean LLM Training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info