Sign in to view source links and access this dataset
Description
minishlab/tokenlearn-cornstack-docs-coderankembed-v2 contains 600,000 rows of mean token embeddings for code documents, generated by the nomic-ai/CodeRankEmbed model. The dataset was created with Tokenlearn for training Model2Vec models on code retrieval tasks. It includes code from CornStack across 6 programming languages, with 100,000 rows per language.
Use Cases
Training code retrieval models based on pre-computed mean token embeddings.
Distilling static embedding models using the provided CodeRankEmbed targets.
Benchmarking code representation learning across 6 programming languages.
Fine-tuning embedding models for semantic search within codebases.
Strengths
Contains 600,000 total rows of data.
Includes a balanced 100,000 rows for each of 6 programming languages.
Embeddings are produced by a specific, named model (nomic-ai/CodeRankEmbed).
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is known but other scale metrics like file size and format are unknown.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
CornStack
Collection Method
Created with Tokenlearn; embeddings produced by nomic-ai/CodeRankEmbed model.
Freshness
Last updated 2026-05-06 03:42:07.
License is unknown; terms of use must be verified before application.