Referring Multi-Object Tracking

Dongming Wu^1*

Wencheng Han^2*

Tiancai Wang³

Xingping Dong⁴

Xiangyu Zhang^3,5

Jianbing Shen²⁺

¹Beijing Institute of Technology ²SKL-IOTSC, CIS, University of Macau

³MEGVII Technology ⁴School of Computer Science, Wuhan University

⁵Beijing Academy of Artificial Intelligence

^*Equal contribution. ⁺Corresponding author.

Abstract

Existing referring understanding tasks tend to involve the detection of a single text-referred object. In this paper, we propose a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking. To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos. To push forward RMOT, we construct one benchmark with scalable expressions based on KITTI, named Refer-KITTI. Specifically, it provides 18 videos with 818 expressions, and each expression in a video is annotated with an average of 10.7 objects. Further, we develop a transformer-based architecture TransRMOT to tackle the new task in an online manner, which achieves impressive detection performance and outperforms other counterparts.

Refer-KITTI

Examples of Refer-KITTI. It provides high-diversity scenes and high-quality annotations referred to by expressions

Labeling exemplar of our datasets. The turning action is labeled with only two clicks on bounding boxes at the starting and ending frames. The intermediate frames are automatically and efficiently labeled with the help of unique identities.It provides high-diversity scenes and high-quality annotations referred to by expressions

TransRMOT

The overall architecture of TransRMOT. It is an online cross-modal tracker and includes four essential parts: feature extractors, cross-modal encoder, decoder, and referent head. The feature extractors embed the input video and the corresponding language query into feature maps/vectors. The cross-modal encoder models comprehensive visual-linguistic representation via efficient fusion. The decoder takes the visual-linguistic features, detect queries and the track queries as inputs and updates the representation of queries. The updated queries are further used to predict the referred objects by the referent head.

Publication

Referring Multi-Object Tracking

Dongming Wu, Wencheng Han, Tiancai Wang, Xingping Dong, Xiangyu Zhang, Jianbing Shen

[Paper] [Code] [Dataset] [Bibtex]

Results

Citation

@InProceedings{wu2023referring,
title={Referring Multi-Object Tracking},
author= {Wu, Dongming and Han, Wencheng and Wang, Tiancai and Dong, Xingping and Zhang, Xiangyu and Shen, Jianbing},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={14633--14642},
year={2023},
}

Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.