Sign in to view source links and access this dataset
Description
A multi-format, process-centric code dataset for training LLM agents. The dataset was empirically validated on 2026-04-10, where fine-tuning a model on version 1.7 with 108,000 training samples for 3 epochs produced a significant performance improvement on the ProcessFlow-Eval benchmark. It was authored by caiovicentino1 and last updated on 2026-04-11.
Use Cases
Fine-tuning LLMs for process-centric code generation based on the dataset's multi-format structure.
Benchmarking LLM agent performance on code-related tasks using the ProcessFlow-Eval metric mentioned.
Training models to understand and generate code sequences for workflow automation, as suggested by the 'process-centric' description.
Strengths
Empirically validated on 2026-04-10, showing a +0.681 ProcessFlow-Eval delta.
Training involved 108,000 samples over 3 epochs, indicating a substantial training corpus.
Validation passed three gates with no HumanEval regression and a perplexity improvement of -4.62 nats.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
huggingface
Freshness
Last updated 2026-04-11 00:45:04.
License is unknown; terms of use must be verified on the dataset page.