Sign in to view source links and access this dataset
Description
Over 1.1 billion lines of code collected from GitHub repositories, focusing on interdisciplinary fields like biology, chemistry, and materials science. The dataset, created by liuhangbiao, is organized into 178 domain-specific topics across approximately 115 GB of data. It was last updated on 2026-04-05.
Use Cases
Training domain-specific code generation models based on the large-scale collection of scientific code.
Analyzing coding patterns and practices in fields like biology and chemistry based on the dataset's topical organization.
Pre-training or fine-tuning models for code summarization or documentation based on the repository source code.
Studying the intersection of software engineering and scientific research based on the interdisciplinary code samples.
Strengths
Contains over 1.1 billion lines of code, providing substantial volume for model training.
Covers 178 distinct domain topics, offering breadth across scientific fields.
Sourced from GitHub repositories, providing real-world, practical code examples.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.
Row count is unknown, which may limit suitability assessment.
Provenance
Source
GitHub repositories
Collection Method
Collected and organized from public repositories.
Freshness
Last updated 2026-04-05 16:42:01; freshness should be verified.
License is unknown and should be verified before use.