A collection of labeled source code snippets across 3+ programming languages including C++, Java, and Python. The dataset categorizes code blocks as either vulnerable or secure to facilitate training for automated security auditing.
Use Cases
- Train a binary classifier to distinguish between secure and vulnerable code using the label and source_code fields.
- Evaluate the cross-language generalization of security models using the C++, Java, and Python subsets.
- Fine-tune a code-specific language model for vulnerability detection using the provided labeled examples.
Strengths
- Includes labeled source code examples for C++, Java, and Python.
- Categorizes code snippets based on the presence of security vulnerabilities.
- Provides a compact data structure for cross-language vulnerability benchmarking.