Cascaded Multi-Task Learning of Head Segmentation and Density Regression for RGBD Crowd Counting

In this paper we propose a novel regression based RGBD crowd counting method. Compared with previous RGBD crowd counting methods which mainly exploit depth cue to facilitate person/head detection, our approach adopts density map regression and is more robust to severe occlusion under dense crowded scenarios. We develop a cascaded depth-aware counting network that jointly performs head segmentation and density map regression. Our network explicitly feeds depth map at each stage so that depth cues are sufficiently exploited. The multi-task strategy allows the network to explicitly attent to foreground regions of a crowd scene and improve density regression. To generate the ground truth of head segmentation and density map, we propose a head scale estimation method according to the basic geometric assumption and camera projection function. Experiments on two public RGBD crowd counting benchmarks, ShanghaiTechRGBD dataset and MICC dataset show that the proposed method achieves new state-of-the-art on both datasets. Further, our method can be easily extended to RGB datasets and achieves comparable performances on WorldExpo’10 dataset and UCF-QNRF dataset.


I. INTRODUCTION
Single image crowd counting aims to estimate the overall person number in a crowded image. It has attracted significant attention in computer vision community during the past years [3]- [6]. Accurately estimating crowd counts of a scene has many applications in real-world scenarios [7]- [9]. For example, the statistics of passenger flow in subway stations are important for scheduling subway trains. The crowd count in a busy street plays an important role in public safety and pedestrian management.
Conventional crowd counting methods [6], [10]- [12] usually estimate crowd counts from RGB images or videos. The RGB crowd counting methods can be categorized into detection based methods [13]- [16] and regression based methods [3], [4], [17]- [19]. The first one treats each human instance in the crowd as an individual object and exploits object detection framework to tackle this problem; while the second one usually extracts low-level features of the scene and applies The associate editor coordinating the review of this manuscript and approving it for publication was Zhenbao Liu . regressors to regress overall crowd counts or density maps. Although recent progress [11], [20], [21] shows significant improvement in RGB crowd counting, the problem itself is still challenging due to severe occlusion, perspective distortion and complex scene backgrounds. During the past years, depth sensors are becoming increasingly popular. Many people propose to exploit depth information to improve crowd counting [2], [22]- [25]. Most existing RGBD crowd counting methods utilize depth information to facilitate detection. However, detection based crowd counting methods are less robust to dense crowded scenarios with severe occlusion, and usually lead to underestimation when the people's heads are tiny/small [2]. To tackle this problem, Lian et al. [2] proposed a density map regression guided detection method. They first utilize a regressor to estimate a density map, which is used as a probability prior to facilitate detection. Although their results show that density map indeed improves detection, the proposed method suffers from several drawbacks. First, the depth cues are not explicitly fed into the regression network, which restricts the performance of density regression, and further affects the detection performance. Second, despite FIGURE 1. RGB images (a) and their corresponding depth maps(c) in ShanghaiTechRGBD dataset. Depth maps are not perfect and contain invalid regions. We generate a bounding box for a head using depth map if a head has valid depth value, as suggested by [1], [2]. As shown in (a), many heads are not assigned with valid depth values(yellow boxes). These heads usually locate at the far regions of an image and hence they are small and dense, so the overall number is large. We enlarge those head regions with null depth values in (b).
the guidance of density map, detecting crowd instances is still challenging. The results of [2] show that crowd counting performance of its detection network is worse than its simple regression network.
In this work, we propose a simple but effective method for regression based RGBD crowd counting. Inspired by the pose estimation methods [26], [27], we propose a cascaded network for RGBD crowd counting. To sufficiently exploit depth information, we explicitly feed the depth map into the network multiple times. In addition to density map regression at each stage, we also predict a segmentation mask of crowd heads. The segmentation mask indicates foreground regions that the density regressor should attent to. Finally we utilize the depth map to generate the ground truth of density and segmentation. In this way, depth information is used in the input of the network and also the ground truth generation, and hence is sufficiently exploited.
Specially, we develop a cascaded depth-aware counting network (Cascaded-DCNet) to jointly estimate the segmentation mask and density map. Our network consists of two stages. The first stage takes image and depth map to estimate an initial segmentation and density map, and the second stage fuses the features, depth map and initial predictions to conduct refinement process. By explicitly predicting foreground mask, our network is able to focus on head regions and estimate density better.
To conduct multi-task learning on head segmentation and density map regression, we need to generate ground truth.
An ideal ground truth of head segmentation may need the accurate label of each pixel indicating whether it belongs to a head. However, labeling each pixel for head segmentation is labor-intensive and not practical since many small/tiny heads only consist of few pixels. As the original annotation of the crowd counting tasks consists of pixel position of each head's center, we may want to estimate the scale of each annotated head, and then put a circle-like mask at the head center to generate head segmentation. In a pinhole camera system, each person's head radius is roughly inverse proportioned to its depth [1], [2]. This means if we can get the depth of each annotated head, then we are able to estimate the corresponding scale. However, we notice that this is usually not practical in real world scenarios since depth maps are not perfect. As shown in Fig. 1, many pixels around heads have invalid depth values. In ShanghaiTechRGBD dataset, the statistics show that there are 38.9% of the annotated heads have no valid depth values. To address this issue, we propose a head depth refinement method that takes the geometric assumption that all the heads of a crowd are on a 3D plane and leverages camera projection function to refine the depth values at those annotated head pixels, which are further used to estimate head scales.
We utilize the estimated head scales to generate a segmentation mask of each person' head. Then the segmentation masks are fused to generate a union segmentation of heads, which indicates the foreground regions of a crowd image. We also utilize the estimated head scale to generate VOLUME 8, 2020 a scale-aware density map, in which each head's location is convolved with a scale-aware gaussian kernel. The density map encodes perspective information which captures scale variance of heads of the crowd scene.
We evaluate our approach on two public RGBD crowd counting benchmarks, ShanghaiTechRGBD dataset [2] and MICC dataset [22]. The results show that our method achieves new state-of-the-art on both datasets and validate the effectiveness of our proposed method. We further extend our method to RGB datasets. Results on WorldExpo'10 dataset [28] and UCF-QNRF dataset [29] show our method achieves comparable performances. We summarize our contributions as follows: • We propose a new cascaded depth-aware counting network for regression based RGBD crowd counting. The depth map is explicitly fed into the network multiple times to extract depth information sufficiently.
• We propose a multi-task learning strategy for head segmentation and density map regression. Our network first estimates a head segmentation and then regresses density map based on the estimated segmentation. In this way our network is able to focus on foreground regions and estimate density better.
• We propose a novel ground truth generation method for head segmentation and density map. We first refine the depth values of the annotated heads according to camera projection function and basic geometric assumption, and then utilize the refined head depth map to estimate head scales, which are further used to generate head segmentation and scale-aware density map.
• Our method achieves new state-of-the-art on two public RGBD crowd counting benchmarks.

II. RELATED WORK A. RGB CROWD COUNTING
Existing RGB crowd counting methods are mainly divided into detection based crowd counting and regression based crowd counting.

1) DETECTION BASED CROWD COUNTING
Detection based methods assume that a crowd is composed of some individual objects and treat crowd counting as an object/person detection problem. Early works [13]- [16] design hand-crafted features to perform person detection, but they are not robust to the severe occlusion or large scale variation on clustered environments or dense crowded scenes. Although recent deep network based object detectors [30], [31] show impressed performance on object detection, they still perform worse then regression based method on crowd counting [2].

2) REGRESSION BASED CROWD COUNTING
Starting from pre-deep learning era, regression based methods [17]- [19], [32]- [34], usually first 6segment foreground regions and extract various low-level features, and utilize a regression model, such as ridge regression [18], Gaussian process regression(GPR) [17] to estimate crowd count. In deep learning era, people formulate the crowd counting problem as a density map regression problem [11], [21], [35]- [39]. Zhang et al. [28] proposed to utilize a patch based crowd counting method by CNN. Zhang et al. [4] first proposed a multi-column CNNs, in which different column CNNs tackle heads with different sizes. Sam et al. [3] improved MCNN and propose a switchable module to classify the crowd density of each patch and assign it to corresponding regressor. Sindagi and Patel [12] proposed a top-down and bottom-up multi-level fusion mechanism to fuse features for crowd counting. CSRNet [21] stacks dilated convolutions after VGGNet [40]. Yan et al. [41] proposed a novel convolution operator that based on estimated perspective map. Our method follows density map regression methods. Previous density regression based works [12], [21], [39], [41] usually first extract image/patch features using a backbone network(e.g. VGG16 [40]), and then perform density regression. Our model has similar structure. However, the input of our network has two sources: RGB image and depth map. The cascaded architecture and multi-task strategy also make our method different from most regression based methods [12], [21], [36], [41].

B. RGBD CROWD COUNTING
To better estimate crowd counts, several works have explored the RGBD crowd counting. Most of these works focus on exploiting depth information to improve person/head detection of a crowd scene. Bondi et al. [22] utilized the depth information to estimate a crowd segment and further localize head candidates. However, the system is not end to end. Song et al. [42] proposed a detection proposal network for depth image based on Faster RCNN [30]. Zhang et al. [23] proposed an unsupervised method to estimate locations of heads based on depth image with vertical view. However, the method assumes the head regions are always closest to the camera compared with other body parts, and hence cannot be generalized to general crowd scenarios. Fu et al. [24] proposed to detect head-shoulder jointly based on template matching to improve robustness. However, in a dense crowd scenario, a person's shoulder is usually occluded. Xu et al. [25] utilized depth map to segment the image into two regions: a far-view region and a near-view region. A density regression module is used to tackle far-view crowd counting and an object detection module is used to tackle near-view crowd counting. However, depth information is not explicitly used for each region's estimation.
Lian et al. [2] proposed a density map guided detection network for joint crowd head detection and density map regression. However, the performance of their detection module does not surpass the regression module. Meanwhile, the depth map is not explicitly fed into the regression module and hence it is not sufficiently used. In contrast, our model leverages cascaded architecture and explicitly fuses depth map twice. We further adopt multi-task learning strategy on head segmentation and density regression, and both of them Overview of our proposed cascaded depth-aware counting network. The first stage takes the RGB image and depth map to generate an initial segmentation probability and density map, and the second stage combines the depth map, estimated predictions and their features to refine the estimations. We feed the depth map at each stage to sufficiently exploit depth cues. The multi-task learning on head segmentation and density map regression allows the network to attent to foreground regions of heads and improve density regression. Each convolution has a kernel size 3 × 3, a dilated convolution has dilation rate = 2. Each max pooling has a kernel size = 2 × 2 with stride = 2. Density maps and segmentation masks are 1/8 of original resolution due to pooling operations.
are supervised by depth guided ground-truths. By exploiting depth information sufficiently, our method is more robust to large scale variation and heavy occlusion under dense crowd scenarios.

III. OUR METHOD A. FORMULATION
Given an image I ∈ R H ×W ×3 with N heads annotated at x = {x 1 , . . . , x N }, where x i ∈ R 2 denotes pixel location of i-th head. We denote the depth map as D ∈ R H ×W , and aim to design a network F that does the mapping {I , D} F − → N . As directly predicting N is highly non-linear, following prior regression based methods [3], [4], we first predict a density map d ∈ R h×w indicating person densities at each pixel and then do the integration over the image/RoI, where h and w are downscaled height and width due to downsampling operations. In addition to density map, we also leverage our network to estimate a head segmentation which indicates foreground mask of a crowd image. The overall problem formulation becomes: where s ∈ [0, 1] h×w denotes estimated head segmentation probability, F is our network and denotes the parameters. In the following subsections we will first describe our network architecture, and then introduce the ground truth generation of density map and segmentation mask. Finally we describe our loss function. In the following equations, the '+' denotes the element-wise addition operation, and '*' denotes the convolution operation, '·' denotes scalar product. Fig. 2 shows an overview of our proposed cascaded network which consists of two stages. The first stage takes original image I and its corresponding depth map D to generate initial segmentation probability and density map; the second stage combines the depth map, estimated predictions and their feature to refine the estimations. Both stages feed depth cue as their inputs and exploit depth information sufficiently through learned convolution filters. Below we will describe each stage in detail:

1) FIRST STAGE
We utilize three convolution layers and two max-pooling layers to extract image features from RGB image, and utilize another two convolution layers to extract depth features from depth map. The image features and depth features are fused by a concatenate operation. Such two stream strategy allows the network to extract image cue and depth cue independently in the shallow layers, hence it can avoid the confliction caused by domain gap between depth distribution and RGB distribution, and hence can be more efficient. Then we use several convolution operations and pooling operation to VOLUME 8, 2020 generate initial predicting feature 0 , which is used to predict a segmentation probability s 0 : where g 0 indicates a convolution operation. The segmentation probability indicates the pixels of head regions, to which a density map regressor should attent. Hence it can be used as an attention to enhance features. We thus fuse s 0 with 0 to generate initial density map d 0 .
where g 0 is a convolution layer that embed s 0 to a feature space with the same dimension as 0 . f 0 is a simple two convolutional neural network.

2) SECOND STAGE
The second stage performs refinement process based on the results of first stage. We first concatenate s 0 , d 0 and D 1/8 , where D 1/8 indicates depth map at 1/8 of image resolution, and use two convolution layers to embed the output predictions, and then add the initial predicting features 0 to generate refinement feature.
where f 0 indicates two convolution layers, ⊕ indicates concatenate operation, ref is the refinement feature. The refinement feature is fed though three dilated convolution layers to extract second predicting feature 1 . The dilation is used to enlarge receptive field. Similar to the first stage, 1 is first used to generate a segmentation mask: where g 1 indicates a convolution operation. Then we fuse s 1 and 1 to generate second density map d 1 : where g 1 is a convolution layer to embed s 1 to a feature space, and f 1 is two layer convolutional neural network.

C. GROUND TRUTH GENERATION
To facilitate multi-task learning using our cascaded counting network, we need to generate ground truth for density map and head segmentation. As the annotation of the crowd counting task only consists of locations of each head's center, we need to first estimate the scale of each head. For segmentation mask generation, we label a pixel to foreground if its distance to a head annotation is smaller to that head's radius. For density map generation, we encode the head scale into the density map. Below we will first describe head scale estimation method, and then introduce the segmentation generation and density map generation.

1) HEAD SCALE ESTIMATION
As suggested in [1] and [2], for a fixed object with fixed physical size(e.g. head), its image size is usually inverse proportioned to the depth of the object due to theorem of similar triangles. The ratio is determined by the focal length of the camera, and we assume it is fixed across images of an existing RGBD dataset. Hence, if we get the depth of an annotated head, then we will get its scale. However, although depth map is provided, it does not always have valid/accurate values across all image pixels. For example, in ShanghaiTechRGBD dataset, the depth map is generated based on stereo matching, which is not very robust to simple textures such as heads/hairs. Further more, its depth map has a valid range of 0 to 20 meters, which does not cover the common crowd area in an image under outdoor scenes. This motivated us to find a way to estimate/refine the depth of heads without valid/accurate values. Assume there is a set of heads located at X = {X 1 , . . . , X N }, where X i ∈ R 3 indicates physical 3D coordinate of i-th head under camera coordinate system. Since there is always enough people in a crowd scene, we can simply assume that each person has the same height and those heads lie on a plane. We denote the height of camera to the head plane as H ∈ R and the unit normal vector of the plane as n ∈ R 3 , then we have: For a standard pinhole camera, we have the projection function: where D ∈ R H ×W represents the depth map of the entire image, x i ∈ R 3 is the projected pixel position on the image, denoted by homogeneous representation, K ∈ R 3×3 is the intrinsic parameter of the camera, and D(x i ) ∈ R is the normalization term indicating the depth of X i . 1 From Eq. 8 we have X i = K −1 D(x i )x i , and thus from Eq. 7 we have 1 We denote W = 1 H n T K −1 ∈ R 1×3 which is a fixed vector across the image representing the relation between x i and D(x i ). We can use those {x i } with valid depth values to estimate W and use the infered W to estimate those {x i } without valid depth available. For a head located at x i with valid depth available, we denote q i = D(x i ) · x i ∈ R 3 . Then for N heads with valid depth available, we have We denote Q = [q 1 , . . . , q N ] ∈ R 3×N and E = [1, . . . , 1] ∈ R 1×N , then we can find the best W that approximates Eq. 10: We have a closed form solution: W = (QQ T ) −1 Q T E. As N is sufficiently large for a crowd, QQ T is to be invertible. We solve W in this way and for any pixel x in the image, we have D(x) = 1/(W x) denoting the depth value if x is a center of a head, and the head radius Fig. 3 shows two examples of our estimated head scales.

2) HEAD SEGMENTATION
As we have estimated head scale for each annotated head, we can utilize it to generate a head segmentation mask by masking a circle around the head centers. The segmentation mask is not perfect compared with human annotated segmentation but it provides the foreground regions that a network should focus on. We use a uniform kernel u r (x) which indicates a kernel with all pixels equal to 1 inside a circle with radius r: Then, our head segmentation mask has the form: where r i = αW x i indicates head radius for i-th head. u r i (x) indicates scale-aware uniform kernel. Examples of our head segmentation mask are shown in Fig. 3.

3) DENSITY MAP WITH HEAD SCALE ENCODING
For an image with N head annotated at x = {x 1 , . . . , x N } where x i ∈ R 2 , we may first convolve a gaussian kernel at each head to generate density map: Eq. 15 is the most commonly used density map generation for existing regression based methods. However, this density map generation method assumes that each person/head is individual on the image, and does not consider the scale variance of heads caused by perspective distortion. A better density map may consider the scale of heads in the gaussian kernel to encode the head scales in the density map, so that a network may easily capture the head regions from RGBD image and aligns to the density map without doing scale normalization. For a head at x i , we have estimated its head scale r i = αW x i by head scale estimation. Thus we may encode the head scale to density map, Eq. 15 changes to: where σ i = β · r i = β · αW x i . VOLUME 8, 2020

D. LOSS FUNCTION
Consider a set of training samples {(I k , D k , d k , m k )} where I k is the RGB image, D k is the depth map, d k is our generated scale-aware density map, m k is the segmentation mask, and the training set has M samples. For each sample, our network estimated two density maps d 0 , d 1 and two segmentation probabilities s 0 , s 1 . We utilize Euclidean loss for density map estimation: For segmentation, we use the average Binary Cross Entropy loss for each pixel, we indicate (18) as the standard BCE loss for input a and target b with spatial resolution h × w, and p indicates pixel location. Our segmentation loss is: The overall loss function is given by: where µ is the weight for segmentation loss to balance the gradient of segmentation and density map estimation.

IV. EXPERIMENTS
In this section, we perform experiments to evaluate our proposed method. We first describe the evaluation datasets and evaluation method. We then report the quantitative comparison on two RGBD benchmarks. We also perform ablation studies to validate the effectiveness of our proposed components or strategies. We then report results on RGB datasets. We finally show some qualitative results to demonstrate the efficacy of our framework.

B. EVALUATION METHOD
Following prior work of crowd counting [4], [12], we use Mean Absolute Error (MAE) and Mean Squared Error(MSE) for evaluation: where N i represents the ground truth head counts, is estimated head counts generated by integration on estimated density map d, and M is the number of testing images.

C. IMPLEMENTATION DETAILS
We set gaussian parameter β = 0.25 and head radius parameter α = 5, the loss weight µ = 5 × 10 −4 . For shanghaiTechRGBD dataset, we first resize the images and depth maps to 1280 × 720 to decrease computation complexity. Depth maps are normalized to 0 to 255 before feed into the network. All the segmentation masks and density maps are generated at 1/8 of original image resolutions. Each image is randomly flipped for data augmentation. During training process, we use Adam optimizer [43]. We set batch size to 4 and initial learning rate to 2×10 −4 . We drop the learning rate to 2×10 −5 and 2×10 −6 at epoch 50 and 100, and stop training at epoch 150.

D. RESULTS ON ShanghaiTechRGBD DATASET
We first evaluate our method on ShanghaiTechRGBD dataset, and the results are shown in Tab. 1. Our final model achieves a MAE of 4.26 and a MSE of 6.27, which is a significant improvement compared with current state-of-the-art method [2]. RDNet [2] is a joint detection and regression network and its regression module utilizes a VGG backbone which requires pre-training on ImageNet. Instead, our proposed network does not need pre-training. We also report a result of CSRNet which only utilizes depth map as input instead of RGB image, the performance are not satisfactory. This is because the depth maps are very noisy and many head regions have no valid depth values. Hence, only utilizing depth maps as input is not applicable for those cluttered outdoor scenarios. Our method combines both depth information and RGB image. It is rather simple but effective on density map regression by utilizing depth information sufficiently.

E. EXPERIMENTS ON MICC DATASET
On MICC dataset, the depth map is generated by Kinect.
As the dataset contains indoor scenes, its depth range is much smaller than outdoor scenes (ShanghaiTechRGBD). However, we notice that the output depth map of kinect sensor still has invalid regions, as shown in Fig. 4. So the head depth refinement process is still required. As the head count of an image in MICC dataset is much less than Shang-haiTechRGBD, we only estimate the head plane parameters for those images with at least 5 heads annotated. For the rest

F. ABLATION STUDY
We perform ablation study on our method to evaluate the effecviteness of our proposed architecture or strategies on ShanghaiTechRGBD dataset. We first perform ablation study on the cascaded depth-aware architecture to validate the performance of cascaded strategy. We then performance ablation study on our proposed scale-aware density map generation. We finally performance ablation study on the joint segmentaton and density map regression task.

1) CASCADED DEPTH-AWARE ARCHITECTURE
We perform ablation study for the cascaded architecture, and the results are shown in Tab. 3. Note that all the experiments are performed using fixed gaussian kernel for density map generation. Comparing the first row and second row, third row and fourth row, we can observe that using depth decreases the MAE by 0.33 and 0.19 respectively, demonstrating that our depth fusion mechanism is helpful for crowd counting.
Comparing the first row and third row, second row and fourth row, we can notice that cascaded architecture improves the performance a lot.

2) DENSITY MAP GENERATION
We perform ablation study on different density map generation methods using the Cascaded-DCNet architecture, and the results are shown in Tab. 4. We first compare the density map with fixed gaussian kernel and depth-adaptive density map as proposed in [2] which utilizes raw depth map to estimate head scales(for pixels with invalid depth values, we pad the head scales using nearest neighbor, as in [2]). We can see that using depth-adaptive density map performs better than fixed kernel. We then utilize our proposed head depth refinement method to refine head depths and further estimate head scales, the proposed scale-aware density map generation further boosts the performance by a MAE of 0.13 and a MSE of 0.20.

TABLE 4.
Ablation study on different density map generation, 'Depth-adaptive' denotes the adaptive-kernel using raw depth, 'Scale-aware' denotes the adaptive-kernel using head depth refinement.

3) MULTI-TASK LEARNING ON SEGMENTATION AND DENSITY REGRESSION
We perform ablation study on multi-task learning of segmentation and density regression, and show the results in Tab. 5. We use the Cascaded-DCNet architecture, and the density map is generated by proposed scale-aware kernel.
We can see that supervising on segmentation improves the performance by 0.27 MAE and 0.50 MSE, demonstrating that segmentation helps the network to better localize the foreground regions of heads.

4) COMPARISON OF MODEL PARAMETERS
We also perform ablation study on model parameters to see how the model complexity affects final performance on ShanghaiTech RGBD dataset. The results are shown in Tab. 6. We compare the parameters in two directions: feature dimension and the number of layers. We increase the feature dimension from 256 to 512 and 1024, and observe that the performance improves, but the improvement becomes smaller. In the meantime, model parameters are increased significantly from 5.03M to 21.14M and further to 84.52M. This is because the model becomes overfitting as parameters increase. For the number of layers, we increase the layer by 1, 2, and 4 layers. We observe that the performance are not consistently becoming better as layer grows. We believe this is because the model becomes overfitting easily when it has more layers.

G. EXPERIMENTS ON RGB CROWD COUNTING DATASETS
In this paper, we propose a cascaded depth-aware network for joint head segmentation and density map regression. However, our proposed multi-task learning strategy and cascaded architecture can be easily extended to RGB crowd counting datasets by removing depth input and depth-aware ground truth generation. By estimating a segmentation mask, our network is able to attend to foreground regions and facilitate density regression. We evaluate our method on World-Expo'10 dataset [28] and UCF-QNRF dataset [29].

1) RESULTS ON WorldExpo'10 DATASET
We compare the results of WorldExpo'10 dataset in Tab. 7. Following previous works [4], [28], we utilize the perspective  maps provided by WorldExpo'10 dataset to generate ground truth density map and head segmentation mask. During testing, we only evaluate the crowd counts within given Region of Interest(RoI). We can see that our method outperforms other methods in two scenes and achieves comparable performance on average with current-state-of-the-art DSSINet [11], which utilizes multi-scale images as inputs and it's based on conditional random fields(CRF). Our method is based on single-scale image and its structure is simple.

2) RESULTS ON UCF-QNRF DATASET
We report the performance of UCF-QNRF dataset in Tab. 8. As the resolutions of images vary significantly, we randomly sample 224 × 224 patches to generate training data. Since UCF-QNRF dataset does not provide depth maps, we utilize k-nearest neighbor to estimate the head scales, which are further utilized to generate ground truth segmentation and density map. We notice that this dataset is much bigger than the ShanghaiTechRGBD dataset, MICC dataset and World-Expo'10 dataset. Hence we replace the backbone of first stage(i.e. the feature extractor of 0 ) to the first ten layers of VGG16 and utilize the pre-trained parameters to initialize our model, as many state-of-the-art methods ( [11], [53]) utilize VGG16 to extract features. Our method achieves comparable performance with DSSINet, and outperforms other methods. It's worth noting that we only utilize coarse head scales estimated by k-nearest neighbor due to the lack of depth maps. We can expect the performance to be further improved if depth maps are used for network input and generating more precise head scales.

H. QUALITATIVE RESULTS
We show qualitative results of ShanghaiTechRGBD dataset and MICC dataset in Fig. 5. We can observe that segmentation predictions are quite reasonable. By multi-task learning on segmentation and density regression, our cascaded model is robust to heavy occlusion, large scale variance and variance of crowd counts.

V. CONCLUSION
In this paper, we propose a novel cascaded depth-aware counting network for regression based RGBD crowd counting. The proposed network explicitly feeds depth map at each stage, exploiting depth cues sufficiently. We design a multi-task strategy that jointly estimates head segmentation and density map. Estimating head segmentation allows the network to focus on foreground regions of heads and improves density regression. To generate ground truth of head segmentation and density map, we first estimate the head scales. As in existing RGBD datasets, depth maps usually have invalid/inaccurate regions, we thus propose a head depth refinement approach to estimate/refine head depth at head locations. The refined head depth map is used to estimate head scales, and further generate segmentation mask and density map. Experiments show that our proposed cascaded network outperforms the single-stage network, and depth cue indeed helps density map regression. We also encode head scales to density map and result shows improvement. By conducting multi-task learning, the results show that predicting segmentation helps the network to attent to foreground regions and improve performance. Our method achieves new state-of-the-art on ShanghaiTechRGBD dataset and MICC dataset. We further extend our method to RGB datasets and it achieves comparable performances on WorldExpo'10 dataset and UCF-QNRF dataset.