35,000 patent documents and abstracts categorized into 9 unbalanced classes for long-context text classification. The data is sampled from the BIGPATENT corpus and specifically includes non-abstract documents that exceed the 512-token threshold.
Use Cases
- Train text classification models to predict one of 9 patent classes using the full document text
- Develop summarization models by treating the patent description as input and the abstract as the target
- Test the performance of long-context window architectures on documents that surpass the 512-token limit
Strengths
- 35,000 total patent records across 9 unbalanced classification categories
- Includes both full-length patent descriptions and corresponding abstracts
- Documents are specifically selected for lengths exceeding 512 tokens