Part Relational Mean Model for Group Re-Identification

Most current research on pedestrian re-identification (ReID) is focusing on single-person ReID. However, people are rarely alone and often walk together in groups. Therefore, there is an urgent need to study the problem of group ReID (G-ReID). G-ReID is challenging because of the difficulties related to the differences in group appearance caused by changes in the group layout and membership. In this paper, we have proposed a part-based minus-average relational and arithmetic mean descriptor (PRM) algorithm to obtain a robust representation of groups. Based on local features, we have designed the arithmetic mean descriptor and the minus-average relational descriptor to solve the G-ReID problem caused by changes in the number of group members and their relative positions within the group. Moreover, the minus-average relational descriptor can also be used to describe the differences in the appearance of group members. Considering the rarity of G-ReID datasets and the need to improve the applicability of the G-ReID algorithm in real scenarios, we have collected a new dataset called the Bus Rapid Transit (BRT) G-ReID dataset. Extensive experimental results demonstrate the effectiveness of the PRM algorithm and indicate that it outperforms state-of-the-art algorithms by 7.5% for the cumulative matching feature (CMC-1) on the i-LIDS MCTS group dataset and by 19.4% for the CMC-1 on the Road Group dataset and it outperforms the baseline by 2.4% for the CMC-1 on the BRT dataset.


I. INTRODUCTION
Person re-identification (ReID) has attracted considerable attention due to its wide range of applications, such as in security and surveillance. Existing research has focused on re-identifying individuals; however, searching for a certain group of persons has rarely been studied. It is usual for a group of people to walk along a street together. As illustrated in Fig. 1, the same group was captured by cameras at different Bus Rapid Transit (BRT) stations in the city center. In this paper, the research objects of the pedestrian ReID task are defined as a group, such as a couple travelling together, students walking together after school, colleagues who have the same work schedule and parents walking with children after school.
The associate editor coordinating the review of this manuscript and approving it for publication was Hongwei Du.
Unlike individual ReID, the aim of group re-identification (G-ReID) is to associate a certain group with different camera views. In addition to the traditional challenges of ReID, such as low resolutions, pose changes, illumination variations and blurred vision, G-ReID poses some unique challenges [1]. Changes in the number of group members and in the relative positions of members within the group can cause differences in the appearance of the group, which is not conducive to group image matching. Therefore, G-ReID is a more challenging task because of the deformable characteristics of groups.
To solve G-ReID, early researchers used traditional manual design methods [2]- [7] to extract group features. Most existing methods view the input group image as an entire unit and extract global or semiglobal features. However, it is not suitable to treat the group as a whole and extract its global or semiglobal features because the changes in the relative positions of group members can alter the visual content of VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Challenges of G-ReID. In addition to traditional challenges, such as low resolutions, pose changes, illumination variations and blurred vision, G-ReID has some unique challenges. Changes in the number of group members and in the relative positions of members within the group can cause deformable characteristics of the group appearance, as shown in Fig. 1. In the figure, there are two groups. One group consists of two women and the other group consists of three men. The group image taken at a Friendly Mall site shows that a woman in red is to the left of another woman in black. The image of the same group captured by another camera at the site of the Academy of Sciences shows that the woman in red is on the right side of the woman in black.
the group. Because of the powerful ability of convolutional neural networks to describe local features, recent studies have shown excellent performance of some deep learning-based methods in visual recognition tasks [8], [9]. Inspired by these works, a large number of deep learning techniques have been applied to single-pedestrian ReID. tasks [10], [11]. Nevertheless, few works have utilized deep learning methods for G-ReID. Deep learning can obtain a good feature summary by gradually summarizing shallow features into deep features. Shallow features represent the local details of objects and deep features express high-level semantic information. In addition, people usually distinguish pedestrians by local features. Therefore, considering the strong recognition of local features and the summary ability of deep features, we use local deep features to describe the outward appearance of a group.
Since the appearance of a group is affected by changes in the number of group members and in the relative positions of the members within the group, the goal is to construct descriptors that are not affected by these deformable characteristics. The part-based minus-average relational and arithmetic mean descriptor (PRM) algorithm are designed for the challenges of the G-ReID. Based on local features, we have designed the arithmetic mean descriptor and the minus-average relational descriptor to solve the G-ReID problem caused by changes in the number of group members and their relative positions within the group. Moreover, the minus-average relational descriptor can also be used to describe the differences in the appearance of group members. Additionally, we input the features obtained by the minus-average relational descriptor and the features obtained by the arithmetic mean descriptor into the cross-entropy loss function. We then apply the gradient descent algorithm to optimize the objective function and obtain 12 classifiers to describe group features. Consequently, the PRM algorithm task is formed for group feature extraction.
The main contributions of this paper include the following: 1) We have proposed a PRM algorithm for G-ReID. 2) Considering the rarity of G-ReID datasets and the need to improve the applicability of the G-ReID algorithm in real-life scenarios, we have collected a new dataset denoted as the BRT G-ReID dataset. Our extensive experimental results demonstrate the effectiveness of the PRM algorithm and indicate that it outperforms state-of-the-art algorithms by 7.5% for the cumulative matching feature (CMC-1) on the i-LIDS MCTS group dataset and by 19.4% for the CMC-1 on the Road Group dataset, and it outperforms the baseline by 2.4% for the CMC-1 on the BRT dataset.

II. RELATED WORKS A. GROUP SEMANTICS 1) MULTIDISCIPLINARY USE OF GROUP SEMANTICS
The research on group semantics involves multidisciplinary fields. Different disciplines that study group semantics have different research perspectives. The studies on group semantics in social humanities [5] [12]- [17] aim to use existing group semantic algorithms to analyse social phenomena and provide technical support for social services. The research on group semantics in the field of computer vision aims to innovate the group semantic algorithm model. To improve the accuracy and efficiency of the subtask, these group semantic models are applied to computer vision subdivision tasks, such as target detection [18], [19], ReID [20], target tracking [21]- [26] and G-ReID [2]- [7] .

2) G-ReID USING GROUP SEMANTICS
We summarize the existing G-ReID tasks as follows. Zheng et al. [4] proposed the center rectangular ring ratio-occurrence descriptor (CRRRO) and block based ratio-occurrence descriptor (BRO). Cai et al. [5] proposed a covariance descriptor for the appearance matching of group images. The covariance descriptor is a discriminative descriptor that captures both the appearance and statistical properties of image regions. Zhu et al. [6] formulated G-ReID as a patch matching task and proposed to learn an ensemble of ''salience channels'' that are robust to illumination variations and that can filter out unreliable and noninformative patch matches. Lisanti et al. [3] proposed a novel encoding scheme based on dictionary learning to perform G-ReID. To circumvent the poor detection performance caused by occlusions, Koperski et al. [7] used fixed regions of interest and employed codebook-based visual representations. In terms of the extraction of group features, early researchers used traditional manual design methods. In addition, most existing methods view the input group image as an entire unit and extract global or semiglobal features.

B. LOCAL FEATURE REPRESENTATION
Local features can consider the geometric properties of data [27], [28]. In addition, local features constitute global features, thus, local features can effectively represent the intrinsic structural relationship. Specifically, there is a spatial relationship between local features. The attribute of consistency of spatial information contributes to feature expression [29]. Therefore, we want to apply local feature information to group appearance modeling. Local features include traditional local features and deep local features.

1) ReID USING LOCAL FEATURES
Pedestrian ReID has been performed with traditional local features. Gray and Hai [30] presented an algorithm for performing viewpoint-invariant pedestrian ReID by using the ensemble of localized features (ELF) representation. Bak et al. [20] proposed a new appearance model based on spatial covariance regions extracted from human body parts. The new spatial pyramid scheme was applied to capture the relationships between human body parts to obtain a discriminative human signature. In [20], Farenzenal et al. proposed features that model three complementary aspects of the human appearance. Farenzena et al. [31] computed a simple vector of attributes that consists of the pixel coordinates for each pixel of an image. These local descriptors are then turned into Fisher vectors that represent the group image. Ma et al. [32] used a local Fisher discriminant analysis algorithm to achieve pedestrian ReID. Pedagadi et al. [33] proposed an effective feature representation called local maximal occurrence (LOMO) and a subspace and metric learning algorithm called cross-view quadratic discriminant analysis (XQDA). Liao et al. [34] proposed a decision function for verification that can be viewed as a joint model of a distance metric and a locally adaptive thresholding rule. The hand-designed descriptors are used in these works to express pedestrian. Actually, hand-designed descriptors would be disturbed by human factors and less intelligence.
The research on ReID that uses deep local features includes the following. Chen et al. [35] proposed a polynomial feature map to describe the matching within each subregion and injected all the feature maps into a unified framework. Yao et al. [36] proposed a deep representation learning procedure named the part loss network (PL-Net) to minimize both the empirical classification risk and the representation learning risk. Sun et al. [37] proposed a uniform partition strategy, namely, a part-based convolutional baseline (PCB), that achieves competitive results with state-of-the-art algorithms, which validate it as a strong convolutional baseline for person retrieval. Suh et al. [38] proposed a two-stream network and a bilinear-pooling layer. Each local feature of the part-aligned map is obtained by a bilinear mapping of the corresponding local appearance and body part descriptors. Sun et al. [39] proposed a visibility-aware part model (VPM) that learns to perceive the visibility of regions through selfsupervision. The visibility awareness allows the VPM to extract region-level features and compare two images with a focus on their shared regions. Researchers use deep learning technology adaptively learn the local feature weight matrix, and perform well on the task of ReID.
2) G-ReID USING LOCAL FEATURES G-ReID works have also involved local region-based descriptors. Zheng et al. [2] proposed a center rectangular ring ratio-occurrence descriptor and a block-based ratio-occurrence descriptor. Lisanti et al. [3] divided the entire group image into uniform small blocks, extracted features based on the small blocks and represented the group image in the form of block sets. However, such methods extract back information at different scales, which increases interference information when expressing group appearance. Moreover, treating the group as a whole and extracting its global or semiglobal features may not yield good performance because changes in the relative positions of group members can alter the visual content of the group.
Current evaluations show that the performance of traditional local feature operators is far inferior to the performance of deep feature operators. The main reason is that deep learning has good feature summary ability. Deep learning can gradually summarize shallow features into deep features. Shallow features represent the local details of objects and deep features express high-level semantic information. Considering the above factors, we intend to detect and clip the members of a group to avoid interference from background information. Finally, considering the stability of human body structure information and the advantages of deep learning, we use the PCB network [40] to implement local feature localization and local feature extraction.

C. RESEARCH ON TARGET RELATIONALITY
In real-life scenarios, we often observe changes in the relative positions of group members within a group. We define the changes in the structure of group members as the relationality. Deng et al. [41] developed a new model that allows the encoding of flexible relations between labels. This model introduced hierarchy and exclusion (HEX) graphs and a new formalism. Ding et al. [42] proposed the HEX model to allow for soft or probabilistic relations between labels. Inspired by recent advances in the relational representation learning of knowledge bases and convolutional object detection networks, Zhang et al. [43] proposed a visual translation embedding network (VTransE) for visual relation detection. Chen et al. [44] learned a novel similarity function that consists of multiple subsimilarity measurements, each of which is in charge of a subregion. Li et al. [45] designed a multiscale context-aware network (MSCAN) to learn the powerful features of the full body and body parts and to capture local VOLUME 9, 2021 context knowledge well by stacking multiscale convolutions in each layer. Chen et al. [46] proposed an algorithm that not only improves the learning of global visual features via a supervision of the overall description but also enforces semantic consistencies between the local visual and linguistic features, which is achieved by building global and local image-language associations. Fei et al. [47] proposed a new saliency learning algorithm based on a three-stream convolutional neural network (CNN) that is first presented to learn the distinctive features of the upper body, lower body and global body. Hu et al. [48] proposed an object relation module that processes a set of objects simultaneously through an interaction between their appearance feature and geometry, which allows their relations to be modeled. Huang et al. [49], [50] represented the coupling relations between every two group members according to the differences in their personal features to efficiently signify the co-occurrence of the two members with only their discrepancies.
The above studies have achieved target classification and ReID by enhancing the relationality among the targets. Inspired by these previous works and the efficient expression ability of deep convolution, we first detect the members of a group, clip a single pedestrian according to the detection frame and extract the local features of the single pedestrian. According to the local features of the group members, the minus-average relational and arithmetic mean descriptors are constructed to strengthen the relationality among group members to solve the problem of G-ReID.

III. APPROACH
The proposed framework of this work is shown in Fig. 3 and Fig. 4. The framework consists of a group feature classifiers training stage and a group feature matching stage [51]. In the training stage, based on local feature location and extraction, we obtain six arithmetic mean features h M j by using the arithmetic mean descriptor and six minus-average relational features h R j by using the minus-average relational descriptor. After dimension reduction, we input these feature vectors into the softmax multiobjective classification function in Eq. (3) and Eq. (4) and use the gradient descent algorithm to optimize the cross-entropy loss function in Eq. (5) and Eq. (6) to obtain the weight matrix W . In the training stage, we obtain 12 feature classifiers of the PRM algorithm. In the matching stage, we extract features from the probe image h Gp and the gallery images h Gg via the PRM model and calculate the distances between the probe feature and the gallery feature to re-identify the group ID of the probe image according to the distances. To facilitate reading, we show the notations used in the PRM algorithm in TABLE 1. Please refer to TABLE 1 for the notations used throughout the paper.

A. DETECTION OF G-ReID
The G-ReID image of the existing database contains multiple pedestrians, and a single pedestrian has no detection frame-labeling information. Therefore, before the G-ReID can work, we must detect the pedestrians and the quality of the pedestrian detection directly affects the accuracy of the G-ReID. The workflow of G-ReID is to first detect the members of the group, crop a single pedestrian picture based on the detection bounding box and design the group features based on the features of the single pedestrian. Considering the efficiency and accuracy of the detection algorithm, we use SSD [52] to detect a single pedestrian in a group. To improve the accuracy of pedestrian detection, we pretrain the SSD model on the INRIA [53] pedestrian datasets.

B. GROUP FEATURE EXPRESSION
Group feature expression is an important stage of the G-ReID task. Good group feature expression can effectively improve the accuracy of G-ReID. G-ReID must solve the problem of the group appearance changes caused by the layout of group members and changes in the number of group members.
To solve these problems, we have proposed the PRM algorithm. The PRM algorithm uses the PCB network structure [40] to implement local feature localization and the extraction of group members. Subsequently, to solve the unique problem of G-ReID, we use an arithmetic mean descriptor and minus-average relational descriptor to work with the local features.

1) LOCAL FEATURE LOCATION AND EXTRACTION
In daily life, people usually distinguish pedestrians by local features, such as differences in hairstyles, facial contours and body weight. Therefore, considering the highly discriminative nature of local features, we use local feature descriptors to describe group images. PCB is a baseline for the person retrieval task [40]. p is a parameter in the baseline algorithm, and its specific meaning is the number of horizontally divided blocks of a detected pedestrian. When p = 6, the pedestrian retrieval task of the baseline algorithm has the FIGURE 2. PCB feature extraction network structure. The size of a single pedestrian picture is unified into an image with an aspect ratio of 3 : 1 and size of 384 × 128. The tensor T is formed after the ResNet50 convolutional network and the T level is divided into 6 bands. The size of the tensor T is 24 × 8 × 3. Then, by using traditional average pooling, T is changed to a column vector g of 1 × 1×2, 048 dimensions and g is transformed into a 1 × 1 × 256 column vector h by the 1 × 1 convolution kernel. The detected single pedestrian is finally expressed as 6 feature column vectors h. highest accuracy. Therefore, parameter p is also set to 6 in our research method. Considering the robust performance of the PCB framework, we use the PCB network structure to realize the local feature location and extraction of a single pedestrian in a group. The network structure of PCB is shown in Fig. 2.

2) ARITHMETIC MEAN DESCRIPTOR
The changes in the relative positions of group members lead to variations in the appearance of the same group. Our algorithm aims to design a consistent appearance representation of the same group image regardless of the changes in group member positions. In addition, a change in the number of group members leads to a change in the group image appearance features, which is not conducive to later group image feature matching. Therefore, we must find the descriptor of the group image to solve these problems. The arithmetic mean descriptor proposed in this paper can solve the G-ReID problem caused by changes in the number and relative positions of group members, as shown in Eq. (1).
Based on the location and extraction of local features, we must find an effective method of expressing group appearance features combined with local features. The number of members in a group is variable. If the local feature vectors of the members in a group are simply concatenated, then the dimensions of feature expression in different groups will be different, which is not conducive to feature matching between different groups. If we use the arithmetic mean operator to calculate group features, then the feature dimensions of each group image are all the same. The feature dimensions of different group images are not affected by the number of members in the group, which is conducive to later-stage cosine similarity calculations between group images. Additionally, the group image represented by the arithmetic mean descriptor can robustly work with differences in the appearance of the group image caused by changes in the relative positions of group members. Moreover, the dimension of the group appearance feature h M j obtained by the arithmetic mean descriptor is fixed and is not related to the change in the number of pedestrians, thereby avoiding the dimension disaster caused by the increase in the number of members in the group.

3) MINUS-AVERAGE RELATIONAL DESCRIPTOR
We use the minus-average relational descriptor h R j to describe the differences in the appearance of group members, as shown in Eq. (2). The algorithm flow of the minus-average relational descriptor is to subtract the corresponding local visual features of the group members and calculate the mean. Therefore, the minus-average relational descriptor can better describe the appearance differences of the members in a group, and reflect the relationship between group members. In addition, the minus-average relational descriptor also can solve the variation in the group appearance caused by changes in the number and relative positions of the group members. The physical mechanism is the same as the arithmetic mean descriptor's mechanism because it uses the same mean operation.
Accordingly, the minus-average relational descriptor not only has the advantages of the arithmetic mean descriptor but also can describe the appearance differences of group members, which contribute to the correct judgement of the G-ReID system.

4) PRM G-ReID ALGORITHM
The PRM algorithm is an integration of the technical strategies described above. Based on local feature location and extraction, we obtain six arithmetic mean features h M j through the arithmetic mean descriptor and six minus-average relational features h R j through the minus-average relational descriptor. We reduce the dimension by using the fully connected layer and obtain the 1 × 200 dimensional vectorsĥ R j FIGURE 3. Architecture of the PRM algorithm. After entering a group of images, we use the SSD detection algorithm pretrained on the INRIA dataset to detect the pedestrians. Subsequently, we crop the detected pedestrians, set the image size to 384 × 128 and set the image ratio to 3: 1. We place the detected single pedestrian into the PCB network structure to achieve local feature localization and feature extraction. Specifically, the detected single pedestrian is divided into an even six blocks and the local feature information for these six blocks is extracted to form six column vectors h with a size of 1 × 1 × 256. We perform the arithmetic mean and minus-average relational operations on h to obtain the arithmetic mean descriptor h M and the minus-average relational descriptor h R . We reduce the dimension using the fully connected layer and obtain the 1 × 200 dimensional vectorsĥ M andĥ R . Then we input these features into the softmax multiobjective classification functions to obtain 200 probability values. Next, we input these probability values into the cross-entropy loss function and use the gradient descent algorithm to optimize and obtain the weight matrix W M of the arithmetic mean descriptor and the weight matrix W R of the minus-average relational descriptor. Finally, the entire process of the PRM algorithm has been completed.
andĥ M j . These feature vectors are subsequently input into the softmax multiobjective classification function shown in Eq. (3) and Eq. (4). Next, we input the probability value into the cross-entropy loss function in Eq. (5) and Eq. (6) and use the gradient descent algorithm to optimize the cross-entropy loss function and calculate the weight matrix W R and W M . Finally, the entire process of the PRM algorithm has been completed. In the group feature expression step, we train 6 arithmetic mean classifiers and 6 minus-average relational classifiers to classify group images. In this way, the PRM algorithm not only solves the problem of the group appearance changes caused by the changes in the number and relative positions of group members but also describes the relations among the group members. Fig. 3 illustrates the architecture of the PRM algorithm.
where,ĥ R j = (h R j ), andĥ M j = (h M j ). P(ŷ = k|ĥ R j ) is the probability that the minus-average relational featureĥ R j of the j-th part of a group image belongs to the k-th category. P(ŷ = k|ĥ M j ) is the probability that the arithmetic mean featuresĥ M j of the j-th part of a group image belongs to the k-th category.
x is a single group image. χ is the set of all training sample images.L R j (x) is the loss of the j-th part of one sample that corresponds to the minus-average relational descriptor. L M j (x) is the loss of the j-th part of one sample that corresponds to the arithmetic mean descriptor. L j (x) is the sum of the total loss of the j-th part of all samples.

C. GROUP FEATURE MATCHING
The research on G-ReID can be divided into two stages. The first stage is feature extraction and classifier training and the second stage is feature matching. In the first stage where, G(·) represents the global descriptor for the concatenation between R and M . ⊕ is the concatenation operator. The test process is shown in Fig. 4. First, we extract a picture from each sample in the test set to form the gallery set. Then, the remaining sample images in the test set constitute the probe set. In this way, there is only one image of the same group in the gallery set and several images of the same group in the probe set. Third, we extract a picture from the probe set and input it into the PRM model to obtain the feature vector of the probe set group image. Next, we input all pictures in the gallery set to the PRM model and obtain all the group feature vectors of the gallery set. The cosine similarity in Eq. (9) is then used to calculate the distance between the feature vectors h Gp of the probe set image and all the image feature vectors h Gg in the gallery set. Because the cosine similarity considers the direction of a vector, it is mostly used in the feature matching of pedestrian ReID. When the cosine similarity is greater, the similarity between the two images is higher. We use the cumulative matching feature (CMC) as the evaluation index of the PRM model. The feature matching process is shown in Fig. 4.

IV. BRT DATASET
The BRT dataset is used to assess the G-ReID. Because the sampling location of the G-ReID dataset is the transfer station of the city's BRT system, the G-ReID dataset is denoted as the BRT dataset. The BRT dataset expands the research scale of G-ReID research from small-scale extensions of blocks and schools to large-scale urban centers, which is beneficial for the spatial visualization of G-ReID results and understanding group behavior. We describe aspects such as the necessity of BRT dataset collection, the BRT dataset overview and sampling scheme, the ground truth and the challenges of the BRT dataset.

A. NECESSITY OF BRT DATASET COLLECTION
First, the BRT dataset is a G-ReID dataset, and a large number of current pedestrian ReID datasets are single-pedestrian ReID datasets. In real life, people often travel together and their travel activities have social attributes. The task of G-ReID is different from that of single-pedestrian ReID. The research on G-ReID has unique difficulties caused by variations in the number and positions of group members. Second, in the BRT G-ReID dataset, the sampling points cover 5 urban administrative areas in the city center. In previous research, the sampling point referred only to a local space area, such as an airport transfer hall or a corner of a campus. The BRT dataset brings many benefits for future G-ReID research. On the one hand, the BRT dataset is convenient for spatiotemporal feature modeling and matching, which is helpful for improving the efficiency and accuracy of G-ReID. On the other hand, the spatial visualization of G-ReID results is conducive to mining and understanding the potential semantic VOLUME 9, 2021

2) BRT DATASET SAMPLING SCHEME
The project team arranged 13 groups that consisted of 6-8 people in each group. Each group had a sampling route. The team members combined freely to ensure the sufficient diversity and quantity of the group images. During the sample collection process, we recorded the time spent in the station and the station name. In addition, group members transferred to different bus routes at least once. Fig. 6 shows a schematic diagram of the group sampling. The groups appear at different blue and red sites. The blue and red lines indicate the sampling routes.

C. GROUND TRUTH OF THE BRT DATASET
To simplify the research, we do not consider the spatial layout of the cameras in the BRT station when we divide the dataset. We label pedestrians collected by different types of cameras at the same site with the same group label. The dataset is divided into the training set and test set. There are 200 types of group pictures in the training set, each of which is a group, with a total of 1, 870 images. There are also 200 types of group pictures in the test set, each of which is a group, with a total of 1, 340 images. The total number of images in the test set and training set is 3, 210. First, we extract a picture from each sample of the test set to form the gallery set. The remaining sample images in the test set constitute a probe set. In this way, the gallery set has only one image of the same group, while the probe set has several images of the same group. We statistically compare BRT with two existing G-ReID datasets, namely, i-LIDS MCTS [7] and Road Group [58] in TABLE 2.

D. CHALLENGES OF THE BRT DATASET
The BRT dataset is collected at a real scene and has practical application value. The real scene is complex and changeable. The collected BRT dataset is subject to interference from natural factors, such as changes in the light, illumination and imaging color. It is also influenced by human factors, such as group members blocking each other, changes in affiliations of the group members, etc. The challenges of the BRT dataset are summarized in 9 aspects, as shown in Fig. 7.

V. EXPERIMENTAL RESULTS
In this section, we introduce the experiment from the following five aspects: the evaluation protocol; pretraining the SSD detector; experiments on the BRT dataset; experiments on the i-LIDS MCTS and Road Group datasets; implementation details and experimental setup.

A. EVALUATION PROTOCOL
Pedestrian ReID approximates matching retrieval or sorting tasks. The basic goal is to use the algorithm model to calculate the distance between the probe image and all images in the gallery set. Then, according to the distance, we obtain a sorted list. The Rank 1 accuracy and cumulative matching characteristic curve are commonly used evaluation 46272 VOLUME 9, 2021  indicators to evaluate, quantify and verify pedestrian ReID algorithms. Rank 1 can be regarded as the traditional classification accuracy. However, in actual application scenarios, by solely depending on the ReID algorithm, we cannot achieve a very high Rank 1 value. Therefore, we cannot truly complete the pedestrian identity consistency matching task across cameras. In this case, having the algorithm return a sorted list is a more practical application of ReID. The user VOLUME 9, 2021 selects the correct matching result from the first N objects on the list. If N is much smaller than the size of the probe set, the ReID algorithm can greatly reduce the labor cost and improve the efficiency of the matching task. The mathematical formula is defined as follows.
where r(n) represents the probability that the n-th element on the sorted list is consistent with the identity of the target to be queried. With N as the abscissa and CMC(N) as the ordinate, the CMC curve can be drawn and it is easy to find that CMC(1) is the same as Rank 1.

B. IMPLEMENTATION DETAILS AND EXPERIMENTAL SETUP
The backbone network of PCB uses ResNet50 [8]. The batch size is set to 32 and the person images are resized to 384 × 128 as inputs. The total training of the PRM algorithm is 60 epochs and the basic learning rate is 0.05. After 30 epochs of training, the learning rate decays to 0.005. Because the i-LIDS MCTS dataset does not have a training set, the article [4] treats the entire i-LIDS MCTS dataset as the test set. Therefore, to ensure that the evaluation ground truths are the same, we use the G-ReID model trained on the BRT dataset and test it directly on the i-LIDS MCTS dataset. We use the labeled detection frame of the Road Group dataset to obtain the information of a single person. Next, we execute the PRM algorithm on the Road Group dataset. The final result is obtained by averaging the results of 10 random splits. We use the cumulative matching characteristics (CMC) as the evaluation metric.

C. PRETRAINING THE SSD DETECTOR
The BRT dataset does not have bounding-box detection. Therefore, before the pedestrian ReID work is carried out, the members of the group must be detected. To improve the pedestrian detection accuracy of the SSD, we pretrain the SSD detector on pedestrian datasets, such as PRW [56] and INRIA [53]. INRIA is currently the most used static pedestrian detection database and provides original pictures and corresponding annotation files. These high-resolution images come from GRAZ-01 and Google. The PRW pedestrian ReID dataset is an extension of the Maretk1501 dataset. The test results are shown in Fig. 8. Pretraining the SSD model on the INRIA pedestrian dataset is more helpful for the G-ReID task, because the INRIA dataset is similar to the BRT dataset, and includes pedestrians with complete bodies, while the PRW pedestrian dataset is different, and contains many cropped images at different scales.

D. EXPERIMENTS ON THE BRT DATASET 1) EVALUATION COMPONENT OF THE PRM ALGORITHM ON THE BRT DATASET
Because the final PRM algorithm is determined by a combination of strategies such as detector pretraining, PCB local feature localization and extraction strategies and group feature descriptors, we must identify the optimal combination of these strategies to ensure the best performance of the PRM algorithm. We test and compare the components of the PRM algorithm on the BRT dataset . TABLE 3 shows the test results for the component of the algorithms on the BRT dataset, where R-k ( k = 1, 5, 10 ) denotes the Rank-k accuracy(%). The PRM algorithm is based on the baseline algorithm, by adding some innovative elements. For the baseline algorithm, we choose the ResNet50 [8] and PCB [37] networks because the ResNet50 has low complexity and good performance. It is the baseline of the ILSVRC [57] and COCO2015 [54] competitions and is ranked first in ImageNet detection, local positioning and segmentation tasks. The PCB algorithm is a recognized baseline in the field of ReID. We evaluate and compare the PRM algorithm with the existing baseline algorithms on the BRT dataset. The R50B algorithm does not perform detection and inputs the entire picture into the ResNet50 network to extract group features. The PCB algorithm does not perform detection and sends the entire picture directly to the PCB network to extract group features. As shown in TABLE 3, the performance of the G-ReID algorithm that uses the PCB network is better than the performance of the G-ReID algorithm that uses R50B, because the ResNet50 feature extraction network does not specify the classification target. The targets in these images are people, animals, vehicles, objects etc. By contrast, the PCB baseline algorithm is designed for single-person ReID. Therefore, we choose the PCB network as the basis for the design of the PRM algorithm.  The PRM algorithm adds relational descriptors based on the baseline PCB algorithm. The SSD-PCB algorithm detects pedestrians by using an SSD detector without pretraining and inputs the detected pedestrians into the PCB network to extract group features. A lower pedestrian detection rate leads to a lower ReID accuracy rate. The SSD-INR-PCB+M denotes the strategy of using the INRIA dataset to pretrain the SSD model of pedestrian detection and using the arithmetic mean descriptor to extract group features. The SSD-INR-PCB+M+U algorithm uses the SSD pedestrian detection model pretrained on the INRIA dataset, the arithmetic mean descriptor and the max relational descriptors to extract group features. In the architecture of the max relational descriptor, we first calculate the difference between the local features and then select the max feature value. The PRM (SSD-INR-PCB+M+R) denotes the strategy of using the INRIA dataset to pretrain the SSD model of pedestrian detection and using the minus-average relational feature and the arithmetic mean descriptor to extract features. Theoretically, the max relational descriptor mainly gives attention to the salient line features of the group, for example, the overall outline information of a pedestrian. However, the minus-average relational descriptor and the arithmetic mean descriptor mainly concentrate on the overall information of the group. The content of the overall information is more abundant than the contour information. Therefore, the expression ability of the minus-average relational descriptor and arithmetic mean descriptor are stronger than the expression ability of the max relational descriptor. The performance of PRM is better than the performance of SSD-INR-PCB+M, which indicates that the minus-average relational descriptor is an effective accumulation factor. The final evaluation results show that the PRM algorithm performs the best.

2) VISUALIZATION
The visualization of the PRM model is presented for the evaluation on the BRT dataset. To express the algorithm performance intuitively, we show a typical test-phase case in Fig. 9. The blue box image from the probe set is the group image that must be queried during the test phase. The red frame picture from the gallery set is the group image that was correctly matched during the test phase. The matching process is performed to calculate the cosine distance between the blue box image and all the group images in the gallery set and then to sort the calculation results. When the cosine similarity is higher, the similarity of the two images is greater. We sort the calculated similarity values, the value with the highest similarity is named Rank 1 and the value with the second-highest similarity is named Rank 2. The sorted results are displayed from left (Rank 1) to right (Rank 5). The evaluation of the visualization also demonstrates the performance of the algorithm. Fig. 9 is the visualization effect of the PRM model that continuously hits an index target on the BRT dataset during the retrieval of the gallery set.

a: VISUALIZATION OF THE CLASS ACTIVATION MAP
We visualize the class activation maps (CAM) in Fig. 10 by using Grad-CAM [58]. The visual group images were taken at different sites of the BRT and the members of the groups in the images were accompanied by mutual occlusion and relative position changes. We use the PCB baseline algorithm to process the entire group image to obtain the CAM activation map (b). We use the PRM algorithm to detect the group members and construct relations for the detected group members to obtain the CAM activation map (c). Compared with the PCB baseline algorithm, the proposed method has a higher activation in the same discriminative area. The PRM algorithm can more effectively capture the relationship information between group members. The PRM algorithm can also show independent hotspots in group members and thermal transitions between group members. This indicates that the proposed method can focus on more discriminative cues.

b: THE T-SNE VISUALIZATION OF GROUP FEATURES
To determine whether the PRM descriptors are effective in group classification from the perspective of feature dimensionality reduction, we use the PCB, SSD-INR-PCB+M and PRM algorithms to extract group features and then use the t-SNE algorithm to visualize these extracted features. The t-SNE algorithm [59] creates a single map that reveals structures at many different scales. This is particularly important for high-dimensional data that lie on several different but also related low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. We randomly selected 6 groups of data and used the t-SNE algorithm to perform a visualization on the training set. The PCB algorithm does not perform detection but rather sends the entire picture directly to the PCB network to extract group features. The test results show that the arithmetic mean descriptor and the minus-average relational descriptor are effective at group classification tasks. Moreover, the performance of the PRM algorithm based on the two descriptors is the best. The PRM algorithm can pull the images from the same identity closer while pushing different identities away from each other.

E. EXPERIMENTS ON THE I-LIDS MCTS AND ROAD GROUP DATASET
In this section, we introduce a comparison between the PRM model and the state-of-the-art algorithms for G-ReID on the i-LIDS MCTS dataset [54] and Road Group dataset [55].

1) DATASET DESCRIPTION
As shown in Fig. 12, the i-LIDS MCTS dataset [54] was captured by the multicamera CCTV network in the airport arrival hall. The dataset was captured from two non-overlapping camera views. A total of 64 groups were extracted and 274 images were cropped. For most groups, four images are available that are from different camera views or from the same camera but are captured at different locations and different times. The capture of these images was subject to large variances in light and occlusion.
The Road Group dataset [55] consists of 162 group pairs taken from a 2-camera view of a crowded road scene. The bounding box coordinates of a total of 1, 099 pedestrians are also provided. The Road Group dataset includes severe object occlusions and large variations in group layout.

2) COMPARISON WITH STATE-OF-THE-ART METHODS
To evaluate the G-ReID performance, we compare the PRM model with the following state-of-the-art methods: CRRRO-BRO [2]; Covariance [5]; and PREF [3]. In the previous methods, the group features were designed by hand. The CRRRO-BRO descriptor attempts to obtain a stable representation against a relative position change between the couple and BRO descriptor is robust to the changes in noncenter-rotation. In this case, CRRRO-BRO achieves decent performance on the i-LIDS MCTS dataset, while the most groups in this dataset contain two pedestrians. The principle of the covariance descriptor [5] is to measure the similarity of two groups by calculating the difference in the covariance matrix. Because the calculation of the covariance matrix is based on local pixel values, it is highly susceptible to interference from background information. Consequently, the performance of the covariance descriptor is limited. PREF (pooling residuals of encoded feature) [3] uses a feature dictionary to express single-person features and then transfers them for group appearance coding. The effect of PREF is limited because changes in group appearance are more complicated than changes in individual pedestrian appearance. As shown in TABLE , the accuracy of the G-ReID is higher with the PRM model than with the state-of-the-art algorithms because the PRM algorithm solves the problems of changes in the number and relative positions of the members within the group. Therefore, PRM is valid for most group datasets, regardless of whether the group in the dataset contains 2 pedestrians or multiple pedestrians.

VI. CONCLUSION
In this paper, we have proposed the PRM algorithm. First, it is based on the local features that are conducive to expressing the internal structure of the human body. Second, the arithmetic mean descriptor and the minus-average relational descriptor solve the G-ReID problem caused by changes in the number and relative positions of group members. Third, the minus-average relational descriptor can describe the differences in the appearance of the group members. The PRM algorithm has a simple structure and has effective performance on the G-ReID task evaluated by test experiments. In addition, considering the rarity of G-ReID datasets and the requirement to improve the applicability of the G-ReID algorithm in real-life scenarios, we have contributed the BRT G-ReID dataset.
In the open world, G-ReID must work with more complex scenarios and more diverse problems. The PRM algorithm is a supervised learning method based on a group dataset with label information. However, there are very few datasets for G-ReID with label information. Therefore, future work can consider using semi-supervised learning and transfer learning methods to address variations in group appearance. Moreover, in real-life scenarios, considerable multi-source information is available for G-ReID. For example, the geographic spatio-temporal information is also a beneficial constraint condition. Therefore, in the next stage, we will integrate spatio-temporal constraint information with image appearance information to address the challenges of G-ReID.