Misraj Structured Data Dump (MSDD) is a large-scale Arabic multimodal dataset created by Misraj. It was extracted and filtered from Common Crawl dumps using a WASM pipeline and uniquely preserves the structural integrity of web content by providing markdown output. The dataset was last updated on September 29, -2025.
Use Cases
- Training Arabic language models based on structured web text.
- Developing multimodal AI systems based on combined Arabic text and other media.
- Benchmarking web content extraction and cleaning pipelines based on the described WASM process.
- Studying the structural patterns of Arabic web content based on the preserved markdown format.
Strengths
- Created specifically to address the lack of high-quality, structured multimodal data for Arabic.
- Preserves structural integrity of web content by providing markdown output.
- Extracted and filtered from the large-scale Common Crawl web corpus.
Limitations
- Row count, file formats, and column-level documentation are unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
- Data may reflect geographic or source bias inherent to the Common Crawl web corpus.
Provenance
- Source
- Common Crawl web dumps.
- Collection Method
- Extracted and filtered using a WASM pipeline.
- Time Range
- null
- Freshness
- Last updated 2025-09-29 06:31:15; freshness should be verified.
- Geography
- Likely contains Arabic-language web content, but specific geographic coverage is unknown.