GLOBE

Introduction

In dynamic visual environments, the ability to consistently recognize and track individual objects over time is a fundamental challenge. While modern vision models like DINOv2 have revolutionized image-level understanding, they fall short when it comes to learning region-specific representations that uniquely identify objects based on their spatial extent in a frame.

To address this gap, we introduce GLobal OBject Embedder (GLOBE), a deep learning model designed to learn object-centric fingerprints from RGB video data. Unlike traditional image encoders, GLOBE leverages binary masks to isolate and encode object regions, enabling robust identification across time and transformations.

Here is an overview of GLOBE’s architecture and training strategy, highlighting how it learns discriminative, temporally consistent embeddings for object regions.

Methods

GLOBE is designed to learn object-centric embeddings from RGB video data using binary masks as spatial priors. The model architecture and training strategy are tailored to ensure that embeddings of the same object instance across time are close in the embedding space, while those of different object instances are well separated.

At the core of GLOBE is a DINOv2 image encoder (Oquab et al., 2023), which extracts high-level visual features from each video frame. These features are passed through a trainable projection layer to adapt them for region-level representation learning. A visual sampler (You et al., 2023) then uses a binary mask to extract object-specific features from the projected feature map, producing a fixed-dimensional embedding vector.

Training is supervised using ground truth binary masks that delineate object regions across video frames. At each training step: a set of objects is randomly sampled from a video, for each object, a subset of frames where it is visible is selected, the corresponding frame and binary mask pairs are processed by GLOBE to produce embeddings. These embeddings are optimized using a custom SimCLR contrastive loss, which generalizes the original formulation to support multiple positive samples per object. This encourages embeddings of the same object across different frames to cluster together, while pushing apart embeddings of different objects. This strategy enables GLOBE to learn robust, temporally consistent object fingerprints that are invariant to appearance changes, occlusions, and motion.

Figure 1: Overview of GLOBE Architecture. The model takes as input a sequence of N video frames and corresponding instance binary masks. Each frame is processed by a DINOv2 image encoder to produce feature maps, which are passed through a trainable projection layer. The visual sampler uses the binary masks to extract region-aligned features, producing N object embeddings of dimension D. These embeddings are optimized using a custom SimCLR-based contrastive loss.

Figure 2: Illustration of the Contrastive Learning Objective in GLOBE. GLOBE leverages ground-truth object tracks, provided as binary masks over video frames, to define contrastive supervision. Each track corresponds to a single object instance observed across time. Positive pairs are formed from embeddings of the same object in different frames, while negative pairs are constructed from embeddings of different objects. The model is trained to produce temporally consistent representations by minimizing the distance between positive pairs and maximizing the distance between negative ones. This encourages the learned embedding space to group together temporally distributed observations of the same object while separating different instances.

Results

To evaluate the quality of the learned object embeddings, we designed a matching and clustering task that simulates real-world object re-identification challenges. The evaluation proceeds as follows:

Temporal Splitting: For each object in a video, we artificially split its visible timeline into sub segments effectively treating different portions of its trajectory as distinct instances.
Embedding Extraction: For each segment, we sample a set of frame-mask pairs and compute their embeddings.
Matching: To determine whether two segments belong to the same object, we compare their embeddings and identify the strongest similarity between any pair. If this exceeds a predefined threshold, the segments are considered a match.
Clustering Evaluation: The predicted matches are compared to the ground truth object identities to assess clustering performance.

Matching similarity	>0.95	>0.9	>0.8
DINOv2	0.644 ± 0.024	0.675 ± 0.020	0.694 ± 0.020
GLOBE	0.773 ± 0.016	0.798 ± 0.016	0.799 ± 0.017

Table 1: F1 score

As a baseline, we use DINOv2 features, aggregated by averaging the feature map values within the object mask. GLOBE embeddings significantly outperform this baseline, achieving 15.13 % improvement in F1 score.

Conclusion

We introduced GLOBE, a novel framework for learning object-centric embeddings from RGB video data using binary masks as spatial priors. By leveraging a DINOv2 backbone and a contrastive learning strategy tailored for object-level consistency, GLOBE effectively learns temporally stable and discriminative representations of individual objects. Our evaluation demonstrates that GLOBE significantly outperforms baseline methods in object matching and clustering tasks, achieving a notable 15.13% improvement in F1 score over DINOv2-based features.

These results highlight GLOBE’s potential for advancing video understanding tasks such as object re-identification, tracking, and segmentation. Future work may explore extending GLOBE to multi-view camera streams, integrating motion cues, or adapting it for real-time applications.

References

Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." arXiv preprint arXiv:2304.07193 (2023).

You, Haoxuan, et al. "Ferret: Refer and ground anything anywhere at any granularity." arXiv preprint arXiv:2310.07704 (2023).