Articles and Papers

A Multi-Task Egocentric Benchmark for Compositional Visual Understanding

By combining the Ramblr Data Engine with NVIDIA Cosmos Transfer 2.5, we create a scalable, production-ready, end-to-end video multiplication pipeline that turns raw video into annotation-rich training data.

The Ramblr Data Engine, which combines low-level annotations with natural language instructions, generates training data tailored for domain-specific tasks, and minimizes the need for prompt engineering or manual data creation.

Rendering pixel-accurate semantic segmentation masks over arbitrary-length video in real-time web applications presents significant technical challenges in synchronization, performance, and memory management.

Detailed and versatile annotations, structured activity and process understanding, integration with VLMs, and instruction generation enable and facilitate unparalleled opportunities for real-world robotic workflows and training.

Training our detection architectures on our full Coffee-Making 101 dataset and also a more specific subset, demonstrates that high-quality annotations result in high-accuracy object detectors.

Ramblr's XGBoost-base activity predictor combines a powerful video encoder with a lightweight boosted trees classifier, enabling fast, high-performance activity prediction for domain-specific applications.

Ramblr's GLobal OBject Embedder (GLOBE) leverages binary masks to isolate and encode object regions, enabling robust global identification across time and transformations.

Menu

Research

Articles and Papers

BARISTA Paper

Synthetic Video Augmentation for Robotics and VLM Tasks

Instruction Generation

HD Mask Rendering

Coffee-Making 101 (for Robots)

Detector Training

Activity Prediction with XGBoost

GLobal OBject Embedder (GLOBE)