Sign in to view source links and access this dataset
Description
A Japanese-to-Simplified Chinese pre-translation dataset extracted from the COM3D2 and CM3D2 video game series. The dataset includes text from the base games, their expansions, and nearly all DLCs up to April 4, 2026. It was created by author mollyadams, with translations primarily generated by GPT-5.2 xhigh and refined by GPT-5.4 xhigh, with a last recorded update on April 24, 2026.
Use Cases
Training machine translation models for Japanese-to-Chinese conversion based on the described large-scale translation effort.
Fine-tuning language models on conversational and informal text based on the video game script content.
Studying translation quality and style differences between AI models like GPT-5.2 and GPT-5.4 based on the described multi-model translation process.
Building parallel corpora for niche domains like video game localization based on the extracted game scripts.
Strengths
Text was translated by a top-tier model, GPT-5.2 xhigh, as stated in the description.
Approximately 40% of the translations were further polished by GPT-5.4 xhigh.
Covers text from multiple game titles and nearly all DLCs, including a limited 10th-anniversary DLC.
Limitations
The description notes that some untranslated text may remain due to GPT refusals that were not detected by the program.
Column-level documentation is absent; field semantics must be inferred after download.
Row count and dataset size are unknown, which may limit suitability assessment.
Provenance
Source
Text extracted from ks script files within arc files of the COM3D2 and CM3D2 video game series.
Collection Method
Text was parsed from game files, translated by AI models (GPT-5.2 xhigh, GPT-5.4 xhigh, deepseek-v3.2 thinking), and audio content was transcribed by Qwen3-ASR-1.7B.
Time Range
Covers game content up to April 4, -2026.
Freshness
Last updated 2026-04-24 08:49:06; freshness should be verified.
Geography
null
License is unknown; users should verify permissions before use.