Code Security Vulnerability Dataset is a curated collection of 175,419 code samples labeled with 31 vulnerability classes, including 30 Common Weakness Enumeration (CWE) types and a 'safe' category. Labels are mapped to OWASP Top 10 2021 categories. The dataset is split into training, validation, and test sets and includes samples from C, C++, Python, JavaScript, Java, PHP, and Go.
Use Cases
- Training multi-label vulnerability detection models based on the 31 labeled classes.
- Benchmarking detection algorithms across multiple programming languages mentioned in the description.
- Mapping detected vulnerabilities to OWASP Top 10 2021 categories for security risk assessment.
- Evaluating model performance on the predefined train/validation/test splits.
Strengths
- Contains 175,419 total code samples, providing a substantial corpus for model training.
- Includes samples from seven programming languages: C, C++, Python, JavaScript, Java, PHP, and Go.
- Provides a structured split of 140,335 training, 17,542 validation, and 17,542 test samples.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Last updated 2026-04-23 15:21:30; freshness should be verified.
Provenance
- Source
- huggingface