Tokenlearn Cornstack Docs Coderankembed V2: Code Embeddings for 6 Languages

Name: Tokenlearn Cornstack Docs Coderankembed V2: Code Embeddings for 6 Languages
Creator: minishlab
Published: 2026-05-05T19:06:55
Keywords: Machine Learning, Code Retrieval, Tabular, Programming Languages

by minishlabUpdated 1mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

minishlab/tokenlearn-cornstack-docs-coderankembed-v2 contains 600,000 rows of mean token embeddings for code documents, generated by the nomic-ai/CodeRankEmbed model. The dataset was created with Tokenlearn for training Model2Vec models on code retrieval tasks. It includes code from CornStack across 6 programming languages, with 100,000 rows per language.

Use Cases

Training code retrieval models based on pre-computed mean token embeddings.
Distilling static embedding models using the provided CodeRankEmbed targets.
Benchmarking code representation learning across 6 programming languages.
Fine-tuning embedding models for semantic search within codebases.

Strengths

Contains 600,000 total rows of data.
Includes a balanced 100,000 rows for each of 6 programming languages.
Embeddings are produced by a specific, named model (nomic-ai/CodeRankEmbed).

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is known but other scale metrics like file size and format are unknown.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: CornStack
Collection Method: Created with Tokenlearn; embeddings produced by nomic-ai/CodeRankEmbed model.
Freshness: Last updated 2026-05-06 03:42:07.

License is unknown; terms of use must be verified before application.

Tabular Machine Learning Code Retrieval Programming Languages

Related Datasets

Quality Score

C42

Description

51

Source

41

Reputation

39

Access

26

Community

11 downloads

1 likes

0 views

Dataset Info

Author: minishlab
Created: May 5, 2026
Updated: May 6, 2026
Last synced: Jun 3, 2026

Access

26

Community

11 downloads

1 likes

0 views

Dataset Info

Author: minishlab
Created: May 5, 2026
Updated: May 6, 2026
Last synced: Jun 3, 2026

Tokenlearn Cornstack Docs Coderankembed V2: Code Embeddings for 6 Languages

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info