I. Introduction
Pedestrian detection that aims to predict a series of bounding boxes enclosing pedestrians for a given image, as a particular branch of general object detection, has been attracting more and more interests in both academia and industry. Driven by the surge of convolutional neural networks (CNN) [1], many CNN-based pedestrian detection approaches have been proposed to boost the performance [2]–[12]. However, these methods all assume that the training and test images have the same distribution, limiting the generalization of the proposed methods. As shown in Figure 1, using the detector inferred on the Caltech dataset [13] obtains a worse detection result on the CityPersons dataset [14].
Samples from the Caltech (a) and CityPersons (b) dataset. Since there exists an obvious visual difference between two datasets, merely applying the detector trained on the Caltech dataset generates many false detection results on the CityPersons, e.g., red and blue bounding boxes represent the positive and negative results.