3.12 terabytes of preprocessed data aggregated from multiple embodied AI sources like Open X-Embodiment and the UMI Community. The dataset, created by OpenEAI, is formatted for Visual-Language-Action (VLA) model pretraining and was last updated on February 25, 2026.
Use Cases
- Pretraining VLA models based on aggregated multimodal data from embodied AI sources.
- Benchmarking embodied AI algorithms based on a unified, large-scale dataset.
- Developing cross-domain robotic skills based on data from multiple community sources.
- Reducing storage costs for model training based on pre-compressed images.
Strengths
- Large scale of approximately 3.12 terabytes of data.
- Data is preprocessed into a common format compatible with a specific dataset loader.
- Images have been compressed to reduce storage costs.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- Aggregated from multiple embodied AI sources, including Open X-Embodiment and UMI Community.
- Collection Method
- Preprocessed and unified into a common format.
- Time Range
- null
- Freshness
- Last updated 2026-02 25 02:12:11; freshness should be verified.
- Geography
- null