Name: Patent Document Clustering Benchmark With Nine Categories
Creator: mteb
Published: 2024-05-24T16:35:39
Keywords: Size Categories10 Kn100 K, Librarypolars, Librarydask, Arxiv190603741, Modalitytext, Librarymlcroissant, Source Datasetsjinaaibig Patent Clustering, Librarydatasets, Licensecc By 40, Mteb, Parquet, Text, Languageeng, Regionus, Task Categoriestext Classification, Multilingualitymonolingual, Arxiv250213595, Annotations Creatorsderived

Description

Test set for clustering documents from the Big Patent dataset, containing documents belonging to nine distinct categories. It is part of the Massive Text Embedding Benchmark (MTEB) for evaluating embedding models on legal and written domain text.

Use Cases

Evaluate embedding models on clustering performance across nine patent document categories.
Benchmark text representation methods for legal and intellectual property documents.
Analyze document similarity within and between the nine specified patent categories.

Strengths

Documents are categorized into nine distinct classes, providing a structured clustering target.
Part of the standardized Massive Text Embedding Benchmark (MTEB) for model evaluation.
Focuses on the legal and written domains, specifically patent documents.

Limitations

Dataset size, row count, and specific column features are unknown.
Only contains a test set, limiting its use for training or validation tasks.
The temporal coverage and geographic origin of the patent documents are unspecified.

Provenance

Source: Derived from the Big Patent dataset, hosted on Hugging Face by mteb.
Collection Method: Clustering of documents from the Big Patent dataset, curated for the MTEB benchmark.
Time Range: null
Freshness: null
Geography: null

This is a benchmark test set only, intended for evaluation, not for training models. The specific column structure and data format are not detailed in the provided input.

Parquet Text Size Categories10 Kn100 K Librarypolars Librarydask Arxiv190603741 Modalitytext Librarymlcroissant Source Datasetsjinaaibig Patent Clustering Librarydatasets Licensecc By 40 Mteb Languageeng Regionus Task Categoriestext Classification Multilingualitymonolingual Arxiv250213595 Annotations Creatorsderived

Patent Document Clustering Benchmark With Nine Categories

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info