Namuwiki Extracted: Korean Wiki Database Dump from March 2022

Name: Namuwiki Extracted: Korean Wiki Database Dump from March 2022
Creator: heegyu
Published: 2022-10-01T01:27:07
Keywords: Text, Korean Text, Wiki Dump, Knowledge Base

by heegyuUpdated 3y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

A dump of the Korean wiki site namu.wiki, containing 571,308 rows of text data. The dataset was extracted and preprocessed using the namu-wiki-extractor tool, with a snapshot date of March 1, 2022. It was uploaded to Hugging Face by user heegyu in January 2023.

Use Cases

Train Korean language models based on the wiki's extensive text corpus.
Analyze the structure and topics of a large Korean collaborative knowledge base.
Build Korean question-answering systems using the wiki's factual content.
Study the evolution of online Korean discourse and information presentation.

Strengths

Contains 571,308 entries, providing a substantial corpus of Korean text.
The raw download size is 2.19GB, indicating significant textual content.
Specific preprocessing steps, such as header and table removal, are documented.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
The description notes unresolved issues with footnote preprocessing for math markdown.
Data is from a single snapshot in March 2022 and may not reflect recent wiki updates.

Provenance

Source: namu.wiki
Collection Method: Database dump extracted and preprocessed with namu-wiki-extractor.
Time Range: Snapshot dated 2022-03-01.
Freshness: Last updated 2023-01-15 09:46:31; freshness should be verified.
Geography: Content is primarily in Korean, likely focused on topics relevant to a Korean-speaking audience.

The description notes specific preprocessing choices and known issues with math markdown in footnotes.

Text Korean Text Wiki Dump Knowledge Base

Related Datasets

Quality Score

D38

Description

51

Source

39

Reputation

20

Access

26

Community

388 downloads

24 likes

0 views

Dataset Info

Author: heegyu
Created: Oct 1, 2022
Updated: Jan 15, 2023
Last synced: May 4, 2026

Access

26

Community

388 downloads

24 likes

0 views

Dataset Info

Author: heegyu
Created: Oct 1, 2022
Updated: Jan 15, 2023
Last synced: May 4, 2026

Namuwiki Extracted: Korean Wiki Database Dump from March 2022

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info