Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
600,000 code documents across six programming languages provide training targets for static embedding distillation. The dataset was created by minishlab using Tokenlearn for training Model2Vec models on code retrieval. Mean token embeddings were produced by the nomic-ai/CodeRankEmbed model.
License is unknown, which may restrict usage.