Summary
Large Language Models (LLMs) revolutionized human-computer interaction, by making complex knowledge accessible through natural language, bridging the gap between humans and machines. Current multimodal models (MLLMs) still fall short when it comes to real-world awareness. They often lack an understanding when and where something is happening – critical elements essential for interacting reliably with the real world. That’s why we’re building REACT: a fine-tuned MLLM designed to understand spatial and temporal context by analyzing video and sensor data. Our goal is to enable machines to follow and understand complex human instructions in dynamic environments.
We’re putting REACT to the test with a real-world scenario: a mobile robot that takes a coffee order directly from a person, operates the coffee machine, and serves the drink – even as conditions around it change. By combining cutting-edge AI with real-time contextual understanding, REACT takes human-machine interaction to the next level.
Goals
Spatiotemporal Contextualization
Fine-tuning MLLMs to interpret spatiotemporal sequences from multimodal inputs.
Real-Time Reasoning
Enabling fast, context-aware decision-making in dynamic environments.
Grounded Instruction Following
Translating natural language into executable steps tied to the environment.
Sensor + Video Fusion
Integrating first- and third-person video with real-world sensor data for richer scene understanding.
Autonomous Task Execution
Demonstrating end-to-end autonomy in a physical human-interaction scenario.