Ramblr Technologies and Research
Activity Prediction with XGBoost
Ramblr's XGBoost-based activity predictor combines a powerful video encoder with a lightweight boosted trees classifier, enabling fast, high-performance activity prediction for domain-specific applications.
Key Contributions and Technical Innovations
Hybrid Architecture for Activity Prediction
We introduce a simple architecture that couples a state-of-the-art video encoder (InternVideo2) with a lightweight XGBoost classifier. This approach efficiently classifies video features into domain-specific activities, minimizing training time and data requirements.
Efficient Fine-Tuning for Domain-Specific Tasks
Our method enables quick-turnaround training for closed-vocabulary activity prediction. By lightly fine-tuning the encoder and training a simple classifier, we achieve competitive performance in data-starved environments, making it ideal for industrial applications.
Significant Improvement in Activity Prediction
The proposed model achieves an 85.8% F1 score on a closed-domain benchmark, improving over the baseline method by more than 40%, demonstrating the method's effectiveness for training efficient, specialized activity predictors.

GLOBE
Ramblr's GLobal OBject Embedder (GLOBE) leverages binary masks to isolate and encode object regions, enabling robust global identification across time and transformations.
Key Contributions and Technical Innovations
Object-Centric Fingerprints for Deep Video Understanding
We design GLOBE to learn temporally consistent object-centric embeddings by extracting features precisely aligned to object regions using binary masks. This region-guided approach enables the model to capture identity-preserving, stable object representations from video.
Generalized Contrastive Learning for Temporal Object Consistency
We extend SimCLR with a multi-positive contrastive loss that clusters multiple views of the same object across time, promoting identity-preserving, temporally robust representations despite occlusions, motion, and appearance changes.
Significant Gains in Temporal Object Re-identification
GLOBE achieves a 15.13% improvement in F1 score over the DINOv2 baseline in a realistic object matching and clustering task, demonstrating strong performance for re-identification and tracking applications.
Research Projects
Show2Instruct envisions AI systems that respond to spoken, context-specific queries – such as “Do all the windows in this room meet the requirements of the BIM specifications and accessibility standards?” – with real-time, visually grounded answers from an AI-powered device.
In the context of a construction site, this could mean smart glasses providing real-time answers to complex compliance questions. In collaboration with industry and academic partners, we are developing multimodal AI agents that integrate large language models with computer vision to enable real-time, context-aware reasoning – demonstrating how generative AI can transform decision-making and workflows in the construction industry.
Our focus is on developing a new generation of human-machine interfaces by architecting a scalable platform for context-aware, multimodal instruction-following systems that perform reliably in high-noise, high-variability environments like construction sites.
Key Contributions and Technical Innovations:
Multimodal Fusion
Integration of LLMs with visual scene understanding models to enable grounded, contextual responses.
Real-Time Interaction
Development of low-latency inference pipelines suitable for edge deployment (e.g., AR glasses).
Domain-Specific Reasoning
Fine-tuning models on construction-specific tasks (e.g., BIM compliance checks, safety standard verification).
Human-Centered Interfaces
Design of natural language-driven workflows to reduce cognitive load and streamline decision-making.
Scalable Architecture
Modular system design enabling deployment across diverse sites and tasks.
MuvAko is a forward-looking research initiative based in Saxony, Germany, and focused on leveraging generative AI and spatial computing to develop multimodal, context-aware assistance systems for virtual and mixed reality.
At its core, MuvAko explores how these cutting-edge technologies enhance the user experience and add measurable value across digital sales processes. In collaboration with our partners, we are addressing key research challenges such as:
-
How can spatially-aware AI systems capture and interpret multimodal context in dynamic real-time e-commerce scenarios?
-
How can recommendation engines be personalized using real-time user data?
-
How can context-aware AI assistants provide meaningful support within immersive shopping environments?
Beyond technical exploration, MuvAko aims to push the boundaries of usability, personalization, and accessibility in AI-driven product presentation and delivery – laying the foundation for the next generation of immersive, intelligent e-commerce platforms.
Key Contributions and Technical Innovations:
Real-Time Multimodal Context Sensing
Fusion of spatial, visual, and behavioral data to guide AI decision-making.
Adaptive Personalization Engines
Dynamic tailoring of recommendations based on user interactions and spatial cues.
Intelligent Virtual Assistants
Context-aware agents that assist users throughout immersive shopping journeys.
Enhanced Usability in XR Interfaces
Streamlined interaction models for intuitive engagement in mixed reality.
Scalable Architecture
Modular system design for integration into diverse digital commerce platforms.
REACT
What if machines could truly understand their surroundings – and respond in real time?
Large Language Models (LLMs) revolutionized human-computer interaction, by making complex knowledge accessible through natural language, bridging the gap between humans and machines. Current multimodal models (MLLMs) still fall short when it comes to real-world awareness. They often lack an understanding when and where something is happening – critical elements essential for interacting reliably with the real world. That’s why we’re building REACT: a fine-tuned MLLM designed to understand spatial and temporal context by analyzing video and sensor data. Our goal is to enable machines to follow and understand complex human instructions in dynamic environments.
We’re putting REACT to the test with a real-world scenario: a mobile robot that takes a coffee order directly from a person, operates the coffee machine, and serves the drink – even as conditions around it change. By combining cutting-edge AI with real-time contextual understanding, REACT takes human-machine interaction to the next level.
Key Contributions and Technical Innovations
Spatiotemporal Contextualization
Fine-tuning MLLMs to interpret spatiotemporal sequences from multimodal inputs.
Real-Time Reasoning
Enabling fast, context-aware decision-making in dynamic environments.
Grounded Instruction Following
Translating natural language into executable steps tied to the environment.
Sensor + Video Fusion
Integrating first- and third-person video with real-world sensor data for richer scene understanding.
Autonomous Task Execution
Demonstrating end-to-end autonomy in a physical human-interaction scenario.