Name: PangeanicYueJa: 55,000 Cantonese-Japanese Parallel Sentences for Machine Translation
Creator: Pangeanic
Published: 2026-06-18T08:23:41
Keywords: Cantonese, Machine Translation, Japanese, Text, Multilingual, Large Scale, Natural Language Processing, Multilingual Nlp, Parallel Corpus

Description

PangeanicYueJa is a parallel corpus containing 55,000 Cantonese-Japanese sentence pairs sampled from a larger collection of approximately 3.08 million pairs. It was created by Pangeanic and released on Hugging Face, with a last recorded update in June 2026. The corpus is designed for training and evaluating machine translation and multilingual language models.

Use Cases

Train machine translation models based on the parallel sentence pairs.
Develop multilingual large language models (LLMs) based on the bilingual text data.
Conduct cross-lingual NLP research based on the aligned sentences.
Build retrieval-augmented generation (RAG) systems based on the bilingual corpus.
Create bilingual embeddings or perform instruction tuning based on the parallel text.

Strengths

Contains 55,000 parallel sentence pairs, providing a substantial foundation for model training.
Sampled from a larger corpus of approximately 3.08 million sentence pairs, suggesting a potential source for scaling.
Explicitly designed for multiple advanced NLP tasks like machine translation, LLM training, and RAG.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count for the full 3.08 million pair source is unknown, which may limit suitability assessment for large-scale projects.
Description metadata is limited; actual data quality and sampling methodology require manual inspection after download.

Provenance

Source: Pangeanic
Freshness: Last updated 2026-06-18 08:48:19; freshness should be verified.

License is unknown; users should verify terms of use before downloading.

Text Japanese Multilingual Cantonese Machine Translation Large Scale Natural Language Processing Multilingual Nlp Parallel Corpus

PangeanicYueJa: 55,000 Cantonese-Japanese Parallel Sentences for Machine Translation

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info