Sign in to view source links and access this dataset
Description
LLM-jp provides a Japanese instruction-tuning dataset containing 33,000 entries. The dataset is a Japanese translation of a subset from the English OASST2 dataset, processed using DeepL. It was created by the LLM-jp collaborative project and last updated on April 28, 2024.
Use Cases
Fine-tuning Japanese language models for instruction-following based on the translated instruction-response pairs.
Benchmarking model performance on Japanese conversational tasks using the structured prompts.
Studying the effects of machine translation on instruction-tuning data quality for non-English languages.
Strengths
Contains 33,000 Japanese instruction-response pairs for model training.
Data provenance is documented, being a translation of a known English subset (OASST2).
Created by a named collaborative project (LLM-jp), suggesting organized development.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is known but specific data features and structure are not detailed in the provided metadata.
Data may reflect translation bias inherent to the use of DeepL for processing.
Provenance
Source
Translated from an English subset of the OASST2 dataset.
Collection Method
Machine translation using DeepL, processed from kunishou/oasst2-135k-ja.
Freshness
Last updated 2024-04-28 16:39:03.
License is unknown; terms of use must be verified before application.