MSDD: Arabic Multimodal Web Content with Markdown Structure

Name: MSDD: Arabic Multimodal Web Content with Markdown Structure
Creator: Misraj
Published: 2025-09-24T12:04:47
Keywords: Arabic Language, Web Content, Common Crawl, Tabular, Large Scale, Multimodal

by MisrajUpdated 9mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Misraj Structured Data Dump (MSDD) is a large-scale Arabic multimodal dataset created by Misraj. It was extracted and filtered from Common Crawl dumps using a WASM pipeline and uniquely preserves the structural integrity of web content by providing markdown output. The dataset was last updated on September 29, -2025.

Use Cases

Training Arabic language models based on structured web text.
Developing multimodal AI systems based on combined Arabic text and other media.
Benchmarking web content extraction and cleaning pipelines based on the described WASM process.
Studying the structural patterns of Arabic web content based on the preserved markdown format.

Strengths

Created specifically to address the lack of high-quality, structured multimodal data for Arabic.
Preserves structural integrity of web content by providing markdown output.
Extracted and filtered from the large-scale Common Crawl web corpus.

Limitations

Row count, file formats, and column-level documentation are unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.
Data may reflect geographic or source bias inherent to the Common Crawl web corpus.

Provenance

Source: Common Crawl web dumps.
Collection Method: Extracted and filtered using a WASM pipeline.
Time Range: null
Freshness: Last updated 2025-09-29 06:31:15; freshness should be verified.
Geography: Likely contains Arabic-language web content, but specific geographic coverage is unknown.

null

Tabular Multimodal Arabic Language Web Content Common Crawl Large Scale

Related Datasets

Quality Score

D36

Description

39

Source

36

Reputation

42

Access

22

Community

17 downloads

3 likes

0 views

Dataset Info

Author: Misraj
Created: Sep 24, 2025
Updated: Sep 29, 2025
Last synced: Jul 14, 2026

Access

22

Community

17 downloads

3 likes

0 views

Dataset Info

Author: Misraj
Created: Sep 24, 2025
Updated: Sep 29, 2025
Last synced: Jul 14, 2026

MSDD: Arabic Multimodal Web Content with Markdown Structure

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info