Sign in to view source links and access this dataset
Description
SEC 10-K Extraction Dataset contains 106 SEC 10-K filing records paired with structured qualitative extractions. The dataset was created by krittamet-rod and was last updated on 2026-05-23. It covers business overview, segments, moat, competition, risks, and strategic initiatives.
Use Cases
Train or evaluate NLP models for extracting structured business information based on the described categories like 'business overview' and 'strategic initiatives'.
Fine-tune language models on financial text-to-JSON tasks based on the 'raw SEC filing text' paired with 'structured JSON extraction'.
Analyze qualitative business themes across companies based on the extracted fields covering 'competition' and 'risks'.
Build datasets for benchmarking information retrieval from long-form financial documents based on the described SEC filing text.
Strengths
Contains 106 paired records, providing a defined corpus for training or evaluation.
Data is structured with a specific JSONL format pairing raw text with structured JSON extractions.
Extractions cover multiple defined qualitative categories, including business overview, segments, moat, competition, risks, and strategic initiatives.
Limitations
Row count is unknown, which may limit suitability assessment.
Column-level documentation is absent; field semantics must be inferred after download.
Last updated 2026-05-23 17:28:07; freshness should be verified.
Provenance
Source
SEC EDGAR filings across multiple tickers and fiscal years.
Collection Method
Likely involves extracting and structuring text from publicly available SEC filings.