Activity Prediction with XGBoost

Abstract

Ramblr's XGBoost-based activity predictor combines a powerful video encoder with a lightweight boosted trees classifier, enabling fast, high-performance activity prediction for domain-specific applications.

Introduction

One of the biggest challenges in the field of video understanding is the detection of activities (“taking cup”, “cutting tomatoes”, “tightening screw”) being performed on a clip, especially in videos with a lot of movement in the field of view and objects changing in the scene – such as egocentric videos.

Existing methodologies rely on powerful architectures for video encoding and decoding, but performance is limited and the complexity of the models is very high. Furthermore, this worsens under production-level, domain-specific conditions, where most models will be data-starved but performance must remain at a competitive level.

To alleviate these drawbacks, we present a simple method for quickly improving performance of video activity prediction in closed-vocabulary settings, which has shown good generalization and ease of use.

Methods

The most popular methods for activity prediction rely on video transformers to extract complex features from groups of frames in a temporal sequence ^{[1, 2, 3]}, given there generally is not enough information in single frames to determine which activity is being performed. These features are then used in 2 main ways:

Used as inputs to a multimodal embedder that can bring them into a common space with text, and from there be decoded as activities with open vocabulary (i.e. extract the activity label directly from the embedding space).
Used as inputs to a decoder trained for the particular vocabulary of activities being performed.

While an open-vocabulary approach is the ideal, these methods generally trail in performance because of the complexity of the activity space and the embedding quality, even in the domain of daily activities. On the other hand, closed-vocabulary models perform better, but are by definition limited by their vocabulary, and expanding this means retraining at least the decoder layers with new, usually scarce data.

Our proposed methodology aims to increase the efficiency of the process by relying on the excellent performance of large encoders in extracting general features from video, and combining it with a very light, boosted trees architecture (XGBoost ^[4]) that requires comparatively less data, compute and time to train on a closed-vocabulary setting.

In particular, our architecture uses the state-of-the-art model InternVideo2 ^[3] as a video encoder, which receives video segments of a defined length. This encoder’s attention pooling layer was lightly fine-tuned using an auxiliary linear output layer that was later removed. The model is used to then embed video segments into a lower dimensional space, which is accessed by a boosted trees classifier with output categories corresponding to the closed, domain-specific vocabulary of choice. Finally, the outputs are post-processed with simple heuristics for temporal consistency.

Figure 1: Diagram of the architecture used for activity prediction. N clips of P frames from a given video are encoded into N features via a fine-tuned InternVideo2 encoder. These features are then classified by a light, boosted trees classifier directly into M activities. Since the activities are generated at the clip level, but real activities are usually longer than that, we rely on this fact to post-process the signal to reduce noise and output the final activities on a given video segment.

Figure 2: Illustration of the activity prediction output. The X-axis shows the progression in frames through an egocentric video of making coffee using a capsule-based machine. The Y-axis shows the probability of each activity happening. As it can be seen in the individual frames displayed (exact frame number in the upper right corner), the predicted activities correctly identify what is happening in the video.

Results

The presented methodology shows competitive performance in closed-vocabulary activities in a daily-life, closed-domain setting (“making coffee”).

The evaluation of activity prediction is nuanced, since the start and end of an activity are hard to define in a general way. Hence, we based our evaluation on the F1 score, looking at how well we recover the activity annotated in the ground truth at a segment level. In particular, the F1 score combines the precision (proportion of activities correctly predicted from the total predicted) with the recall (proportion of activities correctly predicted from the pool of possible correct ones), which gives a good, single-metric view on the quality of the prediction. Since the score will depend on how segments are matched, we present the results using three different thresholds of matching (measured using their Intersection over Union, IoU).

Finally, to validate the methodology, we also compared the predictions to ablations where the encoder was not fine-tuned, where a linear layer was used instead of the boosted trees, or both.

The results of the experiments are shown below for all the activity classes in our closed-domain dataset. As it can be observed, the best performance is obtained with both of the proposed enhancements.

Model	F1@IoU=0.1	F1@IoU=0.3	F1@IoU=0.5
Baseline Linear	0.609	0.609	0.505
Baseline XGBoost	0.750	0.733	0.630
Finetuned Linear	0.762	0.757	0.661
Finetuned XGBoost	0.858	0.855	0.804

Conclusion

The method presented here is a promising approach for the quick-turnaround training of performant, domain-specific activity predictors. Importantly, this method is compatible with the scarce, proprietary and very valuable data usually available in the industrial domain.

While the methodology is still a closed-vocabulary approach that will lack flexibility in environments that are too dynamic, it performs extremely well in more defined applications, which again fits many industrial settings.

Additionally, the method will continue to improve with the development of better video encoders, and with the possibility of fine-tuning also the full encoder with domain-specific data instead of just selected layers. These points make the proposed method an attractive alternative to video activity prediction for production applications.

References

[1] Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[2] Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., ... & Qiao, Y. (2023). VideoMAE V2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer vision and Pattern Recognition.

[3] Wang, Y., Li, K., Li, X., Yu, J., He, Y., Chen, G., ... & Wang, L. (2024). InternVideo2: Scaling foundation models for multimodal video understanding. In European Conference on Computer Vision.

[4] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining.