FeatAug-DETR: Enriching One-to-Many Matching for DETRs With Feature Augmentation

One-to-one matching is a crucial design in DETR-like object detection frameworks. It enables the DETR to perform end-to-end detection. However, it also faces challenges of lacking positive sample supervision and slow convergence speed. Several recent works proposed the one-to-many matching mechanism to accelerate training and boost detection performance. We revisit these methods and model them in a unified format of augmenting the object queries. In this paper, we propose two methods that realize one-to-many matching from a different perspective of augmenting images or image features. The first method is One-to-many Matching via Data Augmentation (denoted as <italic>DataAug-DETR</italic>). It spatially transforms the images and includes multiple augmented versions of each image in the same training batch. Such a simple augmentation strategy already achieves one-to-many matching and surprisingly improves DETR's performance. The second method is One-to-many matching via Feature Augmentation (denoted as <italic>FeatAug-DETR</italic>). Unlike <italic>DataAug-DETR</italic>, it augments the image features instead of the original images and includes multiple augmented features in the same batch to realize one-to-many matching. <italic>FeatAug-DETR</italic> significantly accelerates DETR training and boosts detection performance while keeping the inference speed unchanged. We conduct extensive experiments to evaluate the effectiveness of the proposed approach on DETR variants, including DAB-DETR, Deformable-DETR, and <inline-formula><tex-math notation="LaTeX">$\mathcal {H}$</tex-math><alternatives><mml:math><mml:mi mathvariant="script">H</mml:mi></mml:math><inline-graphic xlink:href="fang-ieq1-3381961.gif"/></alternatives></inline-formula>-Deformable-DETR. Without extra training data, <italic>FeatAug-DETR</italic> shortens the training convergence periods of Deformable-DETR (Zhu et al. 2020) to 24 epochs and achieves 58.3 AP on COCO <monospace><bold>val2017</bold></monospace> set with Swin-L as the backbone.


I. INTRODUCTION
Object detection is a fundamental task in computer vision, which predicts bounding boxes and categories of objects in an image.In the past several years, deep learning made significant success in the object detection task and tens of classic object detectors have been proposed.These classic detectors are mainly based on convolutional neural networks, which include the one-stage detectors [2]- [4] and two-stage detectors [5]- [8].One-to-many label assignment is the core design of the classic detectors, where each ground-truth box is assigned to multiple predictions of the detector.With such a one-to-many matching scheme, these frameworks require human-designed non-maximum suppression (NMS) for post-processing and cannot be trained end-to-end.works aim to improve the one-to-one matching mechanism in DETR [11].The one-to-one matching helps DETR to discard the human-designed NMS for post-processing.In addition, [9] showed that extra one-to-many matching supervisions can lead to faster and better convergence.In the recent Group-DETR [9] and Hybrid Matching [10], extra object queries are used to formulate one-to-many matchings to provide additional supervisions for better training DETRs.
The one-to-many matching mechanism associates each ground truth object with multiple object queries.The object queries interact with the image feature maps containing the objects via cross-attention in the DETR decoder.The one-toone and one-to-many matchings therefore implicitly conduct assignments between queries and the object features from the spatial feature maps.State-of-the-art Group-DETR and Hybrid Matching enhance one-to-one matching by augmenting extra object queries and inputting them into the matching module.They achieve impressive results in accelerating convergence speed and boosting detection performance.
Our initial observation highlights that a straightforward yet appropriately designed data augmentation scheme (DataAug-DETR) can implicitly accomplish one-to-many matching and surprisingly improve DETR performance.By integrating numerous spatially augmented versions of a single image in a single batch, the same objects can be assigned to distinct queries across various augmented images.These one-to-many assignments can greatly enhance detection performance.
Given that DETR queries accumulate information from image feature maps, we propose approximating the impact of spatial image augmentation by applying spatial augmentation to the feature maps, thereby avoiding repeated forwarding of different versions of the same image into the vision backbone.We further propose feature augmentation (FeatAug-DETR) for DETR, which spatially shifts and flips feature maps and arranges different versions of a feature map in the same batch, thereby assigning the same object queries to different objects after feature augmentation.This method is a simple yet effective way to enhance DETR performance.
We conduct extensive experiments to evaluate the efficiency and effectiveness of DataAug-DETR and FeatAug-DETR.As a plug-and-play approach, our proposed modules can be easily integrated into different DETR variants.FeatAug-DETR significantly accelerates convergence and also improves the performance of various DETR detectors, including DAB-DETR [19], Deformable-DETR [1], and H-Deformable-DETR [10].FeatAug-DETR helps Deformable-DETR [1] with Swin-Large backbone [20] reach 55.8 AP with 1× training schedule (12 epochs), which is 1.3 AP higher than that without our method (54.5 AP), while keeping the inference FLOPs unchanged.Moreover, FeatAug-DETR is compatible with H-Deformable-DETR [10] as the feature augmentation realizes one-to-many matchings from a different perspective.
In summary, our contributions are summarized as follows: • We further propose feature augmentation (FeatAug-DETR) for DETR.It augments the feature maps from the vision backbone and significantly accelerates training compared with DataAug-DETR.• When integrating FeatAug-DETR into Hybrid Matching [10], our method achieves 58.3 AP on COCO val2017 with Swin-L backbone and 24 training epochs, surpassing the state-of-the-art performance by 0.5 AP.

A. Classic Object Detectors
Modern object detection models are divided into 2 categories, one-stage detectors, and two-stage detectors.The onestage detectors [3], [4] predict the positions of the objects relying on the anchors.The two-stage detectors [6], [7] first generate region proposals and then predict the object position w.r.t. the proposals.These methods are both anchor-based methods in which the predefined anchors play an important role in the models.These classic object detectors also need hand-designed operations such as NMS as post-processing, which makes they cannot optimize end-to-end.

B. Label Assignment in Classic Object Detectors
Assignment between the ground truth objects and training samples is a widely-investigated topic in classic object detectors [2], [3], [6], [8], [21].Anchor-based detectors [21], [22] utilize Intersection-over-Union (IoU) to apply label assignment.The anchor will be assigned to the maximum IoU ground truth box when the maximum IoU between an anchor and all gt boxes exceeds the predefined IoU threshold.Anchorfree detectors [23], [24] utilize spatial and scale constraints when selecting positive points.The follow-up works [25], [26] propose improvements in a similar direction.The above label assignment methods in classic object detectors are one-tomany matching, which always assigns several object predictions with one ground truth box.Such methods require NMS for post-processing, which makes the detectors hard to train in an end-to-end manner.

C. Detection Transformer
Carion et al. [11] proposed Detection Transformer (DETR), which introduces the Transformer architecture into the object detection field.They also use bipartite matching to implement set-based loss and make the framework an end-to-end architecture.These designs remove the handcraft components such as anchors and NMS in the previous classic object detectors.However, DETR still faces the problem of slow convergence speed and relatively low performance.Also, DETR only uses one scale image feature, which lost the benefit of the multiscale feature which has been proven effective in previous works.The follow-up works proposed improvements to relieve these problems effectively.[27] designed an encoder-only DETR without using a decoder.Anchor-DETR [17] utilizes designed anchor architecture in the decoder to help accelerate training.Conditional-DETR [18] learns conditional spatial query to help cross-attention head to attend to a band containing a distinct region.It significantly shortens the training epochs of DETR from 500 epochs to 50 or 108 epochs.Efficient DETR [28] selects top K positions from the encoder's prediction to enhance decoder queries.Dynamic Head [29] proposed a dynamic decoder to focus on important regions from multiple feature levels.Deformable-DETR [1], [30], [31] proposed deformable attention and replace the original attention mechanism in DETR.It makes utilizing multi-scale image feature feasible in DETR architecture.The follow-up DAB-DETR [19] uses 4-D box coordinates as queries and updates boxes layer-by-layer in the Transformer decoder part.In the later work, DN-DETR [14] and DINO [32], they use bounding box denoising operation and continue to shorten the training period to 3× standard training schedule (36 epochs).

D. One-to-one Matching
Carion et al. [11] uses bipartite matching to implement oneto-one matching between the object queries and ground truth bounding box.The pipeline computes a pair-wise matching cost depending on the class prediction and the similarity of predicted and ground truth boxes.Then the Hungarian algorithm [33] is applied to find the optimal assignment.The design helps DETR achieve end-to-end training in the object detection field without implementing the hand-designed NMS operations.
However, because of the relatively small number of ground truth boxes in each image (always smaller than 50) compared with the number of predictions which are generally larger than 300.The positive sample supervision provided by oneto-one matching is relatively sparse [9].It results in a slow convergence speed for the DETR framework.This problem can be effectively revealed by designing one-to-many matching methods.

E. One-to-Many Matching
Recently, several works [9], [10] discuss the sparsity positive supervision problem of one-to-one matching and proposed one-to-many matching methods on DETR-liked frameworks.Group-DETR [9] decouples the positives into multiple independent groups and keeps only one positive per gt object in each group.Hybrid Matching [10] combines the original one-to-one matching branch with auxiliary queries that use one-to-many matching loss during training.The principles of these two methods are similar.They provide extra object queries in the decoder to produce more positive supervision in the model.The two methods both effectively accelerate convergence speed and boost performance for DETR.
Our method shares the same principle of generating more positive supervision.But unlike Group-DETR or Hybrid Matching, which implement extra object queries, we augment the images in a batch or the features from the backbone to realize the goal.Our method achieves obvious performance improvement in DETR training.Through experiments, we show that our method continues to obtain further performance boosting when applying previous methods (such as Hybrid Matching) together.

A. A Brief Revisit of DETR and One-to-Many Matching
In DETR, an input image I is processed by the backbone B to obtain the feature map F ∈ R Ĉ× Ĥ× Ŵ .A stack of selfattentions and feed-forward networks in the DETR encoder transform to obtain the feature maps.The transformer decoder utilizes cross-attention to guide a series of object queries Q ∈ R N ×C to aggregate information from the feature maps and generate object predictions P , where N denotes the number of object queries and C is the query vector dimension.The predictions include the normalized location predictions P l ∈ R N ×4 and class predictions P c ∈ R N ×(N cls +1) , in which N cls denotes the class number.The one-to-one matching strategy is adopted to associate the predictions P with ground truth objects G.
An object query set Q = {q 1 , q 2 , . . ., q N } ∈ R N ×C is input into the Transformer decoder.The queries aggregate information from image features F through the cross-attention operations in the Transformer decoder D(•) which has L Transformer layers and outputs object predictions.The predictions of each decoder layer are denoted as P 1 , P 2 , . . ., P L respectively.A bipartite one-to-one matching between the object predictions and the ground truth G is conducted at each layer.The process is formulated as: where D l (•) denotes the l th layer of the decoder, L one2one denotes the one-to-one matching loss, and L Hungarian denotes the Hungarian matching (bipartite matching) loss at each layer.Since each prediction p l i ∈ P l is generated from the object query q i , the matchings between ground truths and predictions can be viewed as the matchings between ground truths G and object queries Q.
In order to better supervise DETR and accelerate its training, one-to-many matching schemes were proposed in [9], [10].Group DETR [9] and Hybrid Matching [10] to augment extra object queries Q and introduce one-to-many matching loss, which achieves significant performance gain.Similar to the formulation of one-to-one matching, the general formulation of one-to-many matching can be defined as: where K denotes the number of Hungarian matching groups.In each matching group, the predictions corresponding to k-th group of augmented queries Qk are denoted as P l k , and the augmented ground truths are denoted as Ĝk .
In one-to-many matching, the predictions and ground truths are augmented to generate different groups of positive supervision.Here we discuss the designs of Group-DETR [9] and Hybrid Matching [10] following the above unified formulation (Eq.( 2)). 1) Group-DETR: Group-DETR utilizes K separate groups of object queries Q 1 , . . ., Q K and generates K groups of predictions for each training image.The same set of ground truth G applies one-to-one matching to each group of the predictions respectively.The process can be formulated as: In Group-DETR, multiple groups of object queries are matched to the same set of ground truths.The augmentation on the object queries helps to boost the model performance.
2) Hybrid Matching: In Hybrid Matching, it uses a second group of object queries, which contains T object queries Q = {q 1 , q2 , . . ., qT }.The second group of queries applies oneto-many matching with repeated sets of ground truths Ĝ = { Ĝ1 , Ĝ2 , . . ., ĜK }, where Ĝ1 = Ĝ2 = • • • = ĜK = G.The process of Hybrid Matching can be formulated as: Group-DETR and Hybrid Matching both augment the object queries Q to facilitate the training.Considering the unified formulation of one-to-many matching (Eq.( 2)), Group-DETR and Hybrid Matching both conduct one-to-many matching via augmenting the query set Q but ignore the possibility of jointly augmenting the image features F and the ground truths G.

B. One-to-many Matching via Data Augmentation
Our important observation is that one-to-many matching can also be implemented via augmenting the image feature F and the ground truths G in Eq. ( 2).We explore conducting spatial augmentation on each image multiple times and include them in the same batch.And we experimentally validate that spatially augmented versions of the same image lead to different query-ground truth assignments.
We conduct a pilot study on COCO train2017 dataset, where we augment every image two times with random flipping and cropping.The random flipping and cropping operations follow the same operations as that in DETR [11] for data augmentation.Such augmented image pairs are then input to a trained Deformable-DETR whose parameters are fixed.The Deformable-DETR has 300 object queries and operates in a one-stage manner.We observe that 95.9% of the corresponding objects in the two augmented images are assigned to different object queries.The remaining 4.1% objects that are assigned to the same object queries are mostly located at the same relative positions in the two augmented images.More specifically, ground truth bounding boxes with unchanged queries of the two augmented images have a high average IoU of 78.2%.This pilot study shows that, by spatially augmenting each training image in a proper way, the objects can be assigned to different object queries via spatial image augmentation.It is therefore reasonable to take advantage of image spatial transformation to modify F and G to achieve one-to-many matching for DETRs.
The above pilot study shows that one-to-many matching can be achieved via spatial data augmentation.We propose DataAug-DETR, which conducts spatial image augmentation on each training image and includes them in the same training batch.We adopt the default data augmentation scheme of DETR and Deformable-DETR, which includes a 50% random horizontal flipping and a 50% random cropping, followed by a random resizing to several pre-defined sizes with the short edge ranging from [480, 800].Assume that each image is spatially augmented for N times in each training iteration.We denote the data augmentation operation as T n (•), where n denotes the n-th random data augmentation to an image.The N versions of the image pass through the image feature backbone F in DETR and generate N image features Note that data augmentation is also applied to the ground truth labels G and generates N versions of labels { Ĝn = T n (G)} N n=1 .The augmentation process can be formulated as: Then by applying the bipartite matching on each augmented image individually, the matching process in Eq. ( 2) becomes: where D l (•) denotes the l-th Transformer decoder layer in DETR.Note that in the default setting of our DataAug-DETR, we use a set of object queries Q shared with all images.During the Hungarian matching of different augmented versions of an image, the ground truth objects tend to be matched to different object queries.

C. One-to-many Matching via Feature Augmentation
In our DataAug-DETR, data augmentations are applied to each image multiple times.The augmented N versions of the same image are encoded by the feature backbone, whose computation cost is considerable as the N augmented versions of each image need to be processed by the generally heavy backbone.
Since we choose to perform simple spatial transformations on each training image, the resulting features of the spatially transformed images can also be approximated by conducting the spatial transformations directly on the feature maps F of each image I.In this way, each image only goes through the heavy feature backbone once and can still obtain multiple spatially augmented feature maps F1 , . . ., FN .This strategy is much more efficient than DataAug-DETR and we name it FeatAug-DETR.The process can be formulated as: where E(•) denotes the Transformer encoder of DETR.T n (•) denotes conducting spatial augmentation on the feature map F .We perform feature augmentation to the output feature of the backbone and before the Transformer encoder E(•), which we experimentally found to achieve better performance.The detailed experiment on the selection of the feature augmentation position can be found in Section IV-G2.
After the augmented feature Fn is obtained, a matching process similar to that of DataAug-DETR is applied: For the specific operation of feature augmentation T n (•), we investigate feature map flipping or/and cropping.
1) Feature Map Flipping: Horizontal flipping is performed on the feature map F (denoted as FeatAug-Flip), which is formulated as: After applying FeatAug-Flip, the two augmented feature maps F1 and F2 are forwarded to the follow-up modules of DETR and two separate Hungarian matchings are conducted for the two feature maps in the same training batch following Eq. 2.
2) Feature Map Cropping: Besides the flip operation, random cropping on the feature map is tested (denoted as FeatAug-Crop).The process of FeatAug-Crop is similar to that of FeatAug-Flip but replaces flipping with feature cropping.
The cropping and resizing hyperparameters are the same as the DETR's original image data augmentation cropping and resizing scheme.Since some state-of-the-art DETR-like frameworks are based on Deformable-DETR, which utilizes multiscale features from a backbone, our FeatAug-Crop augments both single-scale and multi-scale feature maps.Deformable-DETR utilizes multi-scale feature maps of 1/8, 1/16, and 1/32 original resolutions, respectively.In order to crop the features of the three scales, we use the RoIAlign [8] to crop and resize the same region across the three scales.The feature augmentation for single-scale features is the same but only conducts the augmentation on one scale.
When cropping the features with RoIAlign, we find that the features after cropping F become blurry due to the bilinear interpolation in RoIAlign.It causes a domain gap between the original feature F and the cropped feature F .When DETR is trained with both types of region features, its detection performance deteriorates.We propose to use extra feature projectors on the cropped features to narrow down the domain gap.Given the three-scale features from the backbone, three individual projectors are adopted to transform the cropped features F , each of which includes a 1 × 1 convolution layer and a group normalization layer [34].The projectors are only applied on F during training, while the original features F are still processed by the original detection head.The cropped feature projectors are able to mitigate the domain gap produced by RoIAlign and avoid performance reduction.
3) Combining Flipping and Cropping: In FeatAug-Flip and FeatAug-Crop, the two versions of feature maps of each image, which include the original and the augmented features, are forwarded through the DETR detection head.It is straightforward to combine the above flipping and cropping operations for feature augmentation.We name this combined version FeatAug-FC.When applying FeatAug-FC, three versions of each image's feature maps, i.e. the original, flipped, and cropped feature maps, are input into the DETR head.
DataAug-DETR and FeatAug-DETR are introduced to augment feature maps in the same training batch to realize one-tomany matching for DETRs from a new perspective.Both our methods improve detection performance and FeatAug-DETR also significantly accelerates DETR training.

A. Dataset and Implementation Details
The experiments are conducted on COCO 2017 object detection dataset [35].The dataset is split into train2017 and val2017, in which the train2017 and val2017 sets contain 118k and 5k images, respectively.There are 7 instances per image on average, up to 63 instances in a single image in the training set.We report the standard average precision (AP) results on COCO val2017.The DETR frameworks are tested with ResNet-50 [36], Swin-Tiny, and Swin-Large [20] backbones.The ResNet-50 and Swin-Tiny backbones are pretrained on ImageNet-1K [37] and Swin-Large is pretrained on ImageNet-22K [37].
We use the L1 loss and GIOU [38] loss for box regression, and focal loss [21] with α = 0.25, γ = 2 for classification.As the setting in DETR [11], we apply auxiliary losses after each decoder layer.Similar to Deformable-DETR [1], we add extra intermediate losses after the query selection module, with the same components as for each decoder layer.We adopt the loss coefficients: 2.0 for classification loss, 5.0 for L1 loss, and 2.0 for GIOU loss, which is the same as [10].
Each tested DETR framework is composed of a feature backbone, a Transformer encoder, a Transformer decoder, and two prediction heads for boxes and labels.All Transformer weights are initialized with Xavier initialization [39].In the experiments, we use 6 layers for both the Transformer encoder and decoder.The hidden dimension of the Transformer layers is 256.The intermediate size of the feed-forward layers in the Transformer blocks is 2048, which follows the settings of [10].The MLP networks for box and label predictions share the same parameters across different decoder layers.We use 300 object queries in the decoder.We use AdamW [40] optimizer with a weight decay 10 −4 .We use an initial learning rate of 2×10 −4 for the Deformable DETR head and a learning rate of 2×10 −5 for the backbone, which is the same as those in [1].A 1/10 learning rate drop is applied at the 11-th, 20-th, and 30th epochs for the 12, 24, and 36 epoch settings, respectively.The model is trained without dropout.The training batch size is 16 and the experiments are run on 16 NVIDIA V100 GPUs.During validation, we select 100 predicted boxes and labels with the largest classification logits for evaluation by default.

B. Main Results
Our DataAug-DETR and FeatAug-DETR methods are compatible with most DETR-like frameworks.The results on COCO val2017 set are shown in Table I.The DAB-DETR-DC5-FeatAug-FC denotes applying FeatAug-FC on top of DAB-DETR [19] with R50 dilated C5-stage image features [41], Deformable-DETR w/ tck.-FeatAug-FC denotes FeatAug-FC on top of Deformable-DETR with tricks, and H-Deformable-DETR-FeatAug-Flip denotes FeatAug-Flip on top of Hybrid matching Deformable-DETR.We experimentally found that, when integrating with Hybrid Matching, FeatAug-Flip achieves better performance than FeatAug-FC as shown in Table II.Thus, we report the results of Hybrid Matching with FeatAug-Flip and others with FeatAug-FC.
In the table, we also report the performances of some previous representative DETR variants.The compared methods include single-scale and multi-scale detectors.The Deformable-DETR ones, which utilize multi-scale features, generally achieve better performances than single-scale detectors.
We compare the performance gain of our DataAug-DETR and FeatAug-DETR on top of the baselines.DataAug-DETR is evaluated on top of Deform-DETR w/ tck.[1].FeatAug-DETR is evaluated on DAB-DETR [19], Deform-DETR w/ tck., and H-Deformable-DETR [10] On the single-scale DAB-DETR with ResNet-50 backbone, FeatAug-FC improves the performance for 1.4 AP with 50 training epochs and reaches 47.1 AP, which makes the single-scale detector's performance better than the multi-scale Deformable-DETR [1] (46.9 AP).It is shown that the AP M and AP L of DAB-DETR-DC5-FeatAug-FC exceed those of Deformable-DETR by large margins (0.8 AP and 2.7 AP, Tricks: tricks described in Section IV-A.† : keep 300 instead of 100 predictions for evaluation.The work Hybrid Matching (denoted as H) [10] also tackles the one-to-many matching problem and achieves state-of-theart performance.We also test our FeatAug-DETR on top of Hybrid Matching with Swin-Large backbone and achieve the performance of 58.3 AP, which is 0.5 AP higher than the previous state-of-the-art DINO.In experiments, we adopt the default hyperparameter settings of Hybrid Matching as [10].

C. Non-spatial vs. Spatial Transformation for DataAug-DETR
As discussed in Section III-B, the object queries' assignments with the ground truths are sensitive to position changes.In other words, when an augmentation changes the relative position of objects, these objects almost always match different queries compared to the original image/feature.Our proposed flipping and cropping augmentations are both spatial transformations that change relative object positions.There are also other widely used data augmentation methods that do not change the objects' relative positions, such as image random resizing.In this section, we also test applying only image random resizing in our DataAug-DETR method, which applies random resizing of each image several times in the same batch.We compare the non-spatial transformation with the Deform-DETR w/ tck.baseline and our DataAug-DETR with default settings.In the experiments, the augmentation times for each image N = 2.
As shown in Table III, the performance of only applying image resizing in DataAug-DETR is even worse than the baseline and our proposed default DataAug-DETR setting.The model converges slower and leads to worse performance than the baseline.This experiment shows that spatial transformations such as flipping and cropping are crucial in the effectiveness of DataAug-DETR.It also shows the rationality of the flipping and cropping feature augmentation in FeatAug-DETR.

D. Convergence Analysis of DataAug-DETR
In the following analysis experiments, unless otherwise specified, we test our proposed method and evaluate its differ- ent designs on top of the Deform-DETR w/ tck.and ResNet-50 backbone and treat it as our experiment baseline.When keeping the same 16 batch size as the baseline, DataAug-DETR can be viewed as changing the order of training data compared with the ordinary training pipeline.Here we investigate the convergence speed of DataAug-DETR AP-epoch curves.
As shown in Figure 3 (left) and Table IV, applying DataAug-DETR improves the final converged performance.The baseline converges with 36 epochs and achieves 49.0 AP.Further training hurts the model as the performance drops when trained for 48 epochs.After applying DataAug-DETR with augmentation times N = 2, its performance at 36 epochs is 0.8 AP higher than the baseline.The performance continues to improve with a 48-epoch training scheme.It is also observed that DataAug-DETR slightly slows down convergence in early epochs.
We also investigate the augmentation number N with DataAug-DETR, Figure 3 (right) shows that the convergence is slower with a larger augmentation number, while the final performance becomes saturated after N ≥ 2. Thus we adopt N = 2 as the default setting unless otherwise specified.

E. Comparison of Different Feature Augmentation Operators
We compare the performance of the proposed FeatAug-Flip, FeatAug-Crop, and FeatAug-FC on the Deform-DETR w/ tck.baseline.The results are listed in Table V.
As shown in the table, all of our FeatAug-DETR's results are better than that of the baseline.FeatAug-Flip is 0.   The results show that FeatAug-Flip converges faster than FeatAug-Crop, while their convergence performance is similar.FeatAug-FC provides further performance improvement compared with FeatAug-Flip and FeatAug-Crop, while it also requires extra training time for epoch.Thus, FeatAug-Flip is suitable for models with relatively small backbones (e.g.R50 and Swin-Tiny).FeatAug-FC is preferred when training with large backbones (e.g.Swin-Large).

G. Ablation Studies
1) Evaluation on one-stage Deformable-DETR: Since we mainly evaluate our method based on Deform-DETR w/ tck., which consists of tricks that boost performance.In order to show that our methods are independent of these tricks and can still improve various DETR variants' performance, we also test our method on top of the one-stage Deformable-DETR, which is its original version.The results are shown in Table VII.
Our method accelerates its convergence speed.FeatAug-FC trained for 36 epochs is 1.4 AP better than the one-stage Deformable-DETR trained for 50 epochs.With the same 50 training epochs, FeatAug-FC reaches 46.3 AP, which is 1.8 AP better than the baseline.
2) Input feature of FeatAug-DETR: In our FeatAug-DETR, we augment the image feature from the backbone and input the augmented features into the Transformer encoder.However, in the Deform-DETR w/ tck.baseline, the features after the Transformer encoder can also be chosen as the alternative for augmentation.To analyze which feature is better as the input of FeatAug-DETR, we test them and the results are shown in Table VIII.FeatAug-DETR on the feature after the Transformer encoder is denoted as FeatAug-Encoder and the augmentation on the feature directly from the backbone is denoted as FeatAug-Default.
As shown in the table, the feature augmentation on the encoder feature actually hurts the performance and is even worse than that of the baseline.Since the Transformer encoder enhances the image features with positional information.If the feature augmentation is applied after the Transformer 3) Cropped feature projectors in FeatAug-DETR: In our FeatAug-Crop and FeatAug-FC, in order to mitigate the domain gap caused by RoIAlign, we adopt projectors after the cropped features.The visualization of the caused domain gap is shown in Figure 5.Here we ablate on removing the individual projectors and using the same projector on both the original and cropped features.The results are shown in Table IX.Using the projectors for cropped features in FeatAug-Crop performs 0.4 AP better than using the same projector for both cropped and original features.The performance gain is more obvious on small objects and large objects, where AP S increases by 1.4 and AP L by 1.3.

H. Visualization of Position-Aware Object Queries
In the pilot study in Section III-A, it shows that spatial augmentation changes the query-object assignments.We further visualize some bounding boxes that each object query predicts.Here we test a trained Deformable-DETR without our DataAug-DETR or FeatAug-DETR.We record all the output bounding box predictions that are matched with the ground truth objects for each query and visualize {center x, center y, width, height} of all the predicted bounding boxes for each query.Here we visualize four queries (#11, #27, #145, and #238) in Figure 6.
As shown in Figure 6, the predicted center of each object query is almost always located around a fixed position, which means that a spatial transformation that changes the bounding boxes' relative positions on the feature map (such as flipping) can change the matched object queries.Furthermore, by observing the predicted heights and widths distribution of Query #11 and Query #27, it shows that though the two queries predict similar center positions, their predicted height and width are differently distributed.Query #11 always predicts large objects while Query #27 is always in charge of small objects.It shows that the cropping operation that changes the sizes of the predicted bounding boxes' also varies the objectquery matchings.
The above two observations show that the object queries assigned to predicted bounding boxes are highly position-aware.Applying flipping and cropping operations makes different object queries match different ground truth objects.
V. CONCLUSION In this paper, we enrich the formulation of one-to-many matching for DETRs.We summarize the one-to-many matching mechanism of augmenting object queries for Group-DETR [9] and Hybrid Matching [10] and propose augmenting image features to implement one-to-many matching.To be detailed, we proposed DataAug-DETR method to help DETRlike methods to achieve higher performance after convergence.Further, we also propose the FeatAug-DETR method, including FeatAug-Flip.FeatAug-Crop, and FeatAug-FC.The FeatAug-DETR method significantly accelerates DETR training and accomplishes better performance.FeatAug-DETR improve the performance of Deformable-DETR [1] with different backbones for around 1.0 AP and shorten the training epochs to only 1× or 2× standard training schedule.When applying FeatAug-DETR together with Hybrid Matching [10] and using a Swin-Large backbone, we achieve the current state-of-the-art performance of 58.3 AP.

Fig. 2 .
Fig. 2. DataAug-DETR augments images several times and includes multiple augmented versions in the same batch.FeatAug-DETR augments feature maps from the vision backbone multiple times and include them in the same batch.Our FeatAug-DETR can augment both single-scale and multi-scale feature maps.Only single-scale feature maps are shown here for simplicity.
. DataAug-DETR (48 epochs training) improves the performance by about 1.0 AP compared with the Deform-DETR w/ tck.(36 epochs training) baseline, while the performance of the baseline trained with 48 epochs degrades.In each epoch of DataAug-DETR, we only train 1/N of the whole training set, where N is the augmentation times of DataAug-DETR.Thus, the training time per epoch of DataAug-DETR is the same as that of the baseline.The detailed investigation of the convergence speed of DataAug-DETR is shown in Section IV-D.

Fig. 3 .
Fig. 3. Left: Training AP curves of Deform-DETR w/ tck.baseline and DataAug-DETR with N = 2. Right: Training AP curves of DataAug-DETR with N = 2 and N = 3.It shows that augmenting each image for more than two times results in slower convergence and is non-beneficial to the final performance.

Fig. 5 .
Fig. 5. Visualization of randomly selected one channel feature of the output of vision backbone.It visualizes the feature domain gap caused by RoIAlign, we adopt projectors after the cropped features to mitigate the domain gap.Left: Original input image.Middle: Original feature from vision backbone.Right: Feature after RoIAlign operation.It shows that the augmented feature becomes blurry compared with the original one, and such a domain gap causes detection performance degradation.Here we adopt the 1/8 input-resolution feature map from the R50 backbone trained with Deform-DETR w/ tck.

Fig. 6 .
Fig. 6.We visualize the predicted bounding boxes position distribution of 4 randomly selected object queries on 10000 random selected images from COCO train2017 set.The top 4 figures show the predicted center x and center y of the corresponding bounding boxes, while the bottom 4 figures show the predicted heights and widths of the queries.

TABLE I MAIN
RESULTS OF PROPOSED DATA AUGMENTATION AND FEATURE AUGMENTATION ON VARIOUS DETR FRAMEWORKS

TABLE III THE
PERFORMANCE OF DataAug-DETR WITH NON-SPATIAL Left: Training AP curves w.r.t training epochs of Deform-DETR w/ tck.and that integrated with our FeatAug-DETR.Right: Traing AP curves w.r.t GPU×Hours of Deform-DETR w/ tck.and that integrated with our FeatAug-DETR.

TABLE V RESULTS
OF DIFFERENT FeatAug-DETR METHODS ON DEFORM-DETR W/ TCK.Method #ep.AP AP 50 AP 75 AP S AP M AP L

TABLE IX ABLATION
ON CROPPED FEATURE PROJECTOR IN FeatAug-Crop ON DEFORM-DETR W/ TCK.Method #ep.AP AP 50 AP 75 AP S AP M AP L The training time per epoch comparisons with different backbones is shown in Table VI.The table reports the training time of the Deform-DETR w/ tck.baseline and that integrated with FeatAug-Flip.The training time of FeatAug-Crop is similar to FeatAug-Flip.Our training process uses 16 NVIDIA V100 GPUs with a batch size of 16.As shown in Table VI, the increase in the computation time is invariant with the size of the backbone.