I. Introduction
Recent years have witnessed an explosion of video big data, especially for the surveillance video [1] which is becoming the biggest big data in the digital universe. In particular, according to the prediction from NVIDIA [2], there will be 1 billion cameras deployed in 2020 producing extreme high-volume data in real-time. The large-scale visual data are of paramount significance for social security, smart city and intelligent manufacturing. However, it is usually impractical to rely on manpower only for the utilization of the video big data, especially in the scenario of safeguarding which requires real-time monitoring and rapid response. Recently, the advances of computer vision have substantially promoted the performance of visual analysis and enabled them to be applied in practical application scenarios. As such, in these circumstances, efficient management of the large scale visual data is highly desired.