Sign in to view source links and access this dataset
Description
A dump of the Korean wiki site namu.wiki, containing 571,308 rows of text data. The dataset was extracted and preprocessed using the namu-wiki-extractor tool, with a snapshot date of March 1, 2022. It was uploaded to Hugging Face by user heegyu in January 2023.
Use Cases
Train Korean language models based on the wiki's extensive text corpus.
Analyze the structure and topics of a large Korean collaborative knowledge base.
Build Korean question-answering systems using the wiki's factual content.
Study the evolution of online Korean discourse and information presentation.
Strengths
Contains 571,308 entries, providing a substantial corpus of Korean text.
The raw download size is 2.19GB, indicating significant textual content.
Specific preprocessing steps, such as header and table removal, are documented.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
The description notes unresolved issues with footnote preprocessing for math markdown.
Data is from a single snapshot in March 2022 and may not reflect recent wiki updates.
Provenance
Source
namu.wiki
Collection Method
Database dump extracted and preprocessed with namu-wiki-extractor.
Time Range
Snapshot dated 2022-03-01.
Freshness
Last updated 2023-01-15 09:46:31; freshness should be verified.
Geography
Content is primarily in Korean, likely focused on topics relevant to a Korean-speaking audience.
The description notes specific preprocessing choices and known issues with math markdown in footnotes.