2.6 million individual lines of Python 3 source code extracted from the CodeSearchNet repository. Each entry is a standalone, syntactically valid code snippet under 125 characters stored in a single 'text' column.
Use Cases
- Evaluate the validity of Variational-Autoencoder (VAE) latent spaces by measuring the percentage of decoded 'text' strings that form valid Python syntax.
- Train character-level or subword-level language models on the 'text' column for code completion tasks.
- Benchmark greedy decoding algorithms by testing their ability to generate AST-parsable code from the provided 'text' samples.
Strengths
- 2.6 million rows of syntactically valid Python code.
- All entries are guaranteed to be parsable into a Python 3 Abstract Syntax Tree (AST).
- Strict length constraints with a maximum of 125 characters per 'text' entry.
- Data is formatted as JSON objects with a single 'text' key per line.