Military Vehicle Object Detection Based on Hierarchical Feature Representation and Refined Localization

Military vehicle object detection technology in complex environments is the basis for the implementation of reconnaissance and tracking tasks for weapons and equipment, and is of great significance for information and intelligent combat. In response to the poor performance of traditional detection algorithms in military vehicle detection, we propose a military vehicle detection method based on hierarchical feature representation and reinforcement learning refinement localization, referred to as MVODM. First, for the military vehicle detection task, we construct a reliable dataset MVD. Second, we design two strategies, hierarchical feature representation and reinforcement learning-based refinement localization, to improve the detector. The hierarchical feature representation strategy can help the detector select the feature representation layer suitable for the object scale, and the reinforcement learning-based refinement localization strategy can improve the accuracy of the object localization boxes. The combination of these two strategies can effectively improve the performance of the detector. Finally, the experimental results on the homemade dataset show that our proposed MVODM has excellent detection performance and can better accomplish the detection task of military vehicles.

INDEX TERMS Military vehicle objects, object detection, reinforcement learning, hierarchical feature representation. The final concise description of instance i at time step t. s t The state representation of the agent at time step t. a t The action performed by the agent at time step t. 17 The associate editor coordinating the review of this manuscript and approving it for publication was Zhongyi Guo . r 1 (s t , s t+1 ) Reward for agent transfer from state s t selection action a t to state s t+1 . r 2 (T ) The sequence reward that the agent receives at the end of the action sequence.

19
With the development and progress of technology, modern 20 warfare is gradually moving into the era of informationiza-21 tion and intelligence. In the future information-based war-22 fare, efficient battlefield situational awareness capability is 23 undoubtedly the key to guarantee the victory of war, and the 24 military powers are now actively strengthening the research 25 of related technology in this aspect [1]. Battlefield situa-26 tional awareness includes several tasks such as reconnais-27 sance, surveillance, intelligence, damage assessment, beacon 28 ization strategy based on reinforcement learning, which can 85 effectively improve the localization accuracy of the predic-86 tion frame and enhance the detector performance;

87
(4) The experimental results on the self-built dataset show 88 that our proposed MVODM has excellent detection perfor-89 mance and is able to perform the detection task of military 90 vehicles better.

91
The remainder of this paper is organized as follows: we 92 briefly review related work on military target detection in 93 Section 2. Section 3 presents the military vehicle dataset we 94 constructed. In Section 4, we present our MVODM in detail. 95 In Section 5, we conduct extensive experiments and analyze 96 and discuss the experimental results. Finally, Section 6 sum-97 marizes the full work. The task of object detection is to automatically identify and 101 locate the object to be detected from an image or video. It has 102 been a hot research topic in the field of computer vision. Tra-103 ditional object detection methods mainly use hand-designed 104 features to train classifiers, which include HOG [9], Haar 105 [10], CSS [11], LBP [12]    However, these datasets also generally have some problems: 160 (1) firstly, scholars' homemade datasets are seldom publicly 161 available and cannot be directly used, (2) secondly, these 162 datasets usually do not consider the scale of military targets 163 and are not well targeted for realistic military target detection 164 tasks, (3) finally, these datasets contain a wide range of 165 military targets that are usually different from the military 166 vehicle detection task of this paper. Therefore, a new military 167 vehicle dataset (MVD) was constructed to better carry out the 168 research work in this paper.    We randomly selected 70% of the images from different types 197 of military vehicle targets as the training set samples and 198 the remaining 30% as the test set samples. scale objects, while small scale objects respond strongly in 249 low level feature maps. Therefore, to address the multi-scale 250 problem in our military vehicle detection task, we consider 251 using a hierarchical feature representation strategy to gen-252 erate different levels of feature representations of the object 253 to be detected and select the best feature representation by 254 a subsequent reinforcement learning strategy to improve the 255 detection performance. Specifically, as shown in the green 256 box in Figure 6, we selected C 3 , C 4 and C 5 from ResNet50 257 to serve as the workspace for the layered feature represen-258 tation. We perform a 1 × 1 convolution operation on the 259 output of these three feature layers and feed RPN and ROI 260 Align to obtain the feature vectors of military vehicle target 261 proposals for each layer, which are fed into a fully connected 262 layer for generating initial military vehicle bounding box pre-263 dictions, including softmax classification and bounding box 264 regression. To more accurately localize military vehicle targets in images, 268 we consider introducing reinforcement learning to further 269 refine the target bounding box. Meanwhile, inspired by pre-270 vious work [34], we use a recurrent neural network-based 271 framework to design our refinement localization strategy. 272 Figure 7 illustrates part of the recurrent process of our refine-273 ment localization strategy.

274
As shown in Figure 7, at each time step t, B t−1 represents 275 the bounding box information of the previous time step (when 276 t=1, B 0 is the original bounding box information), and θ t (i) 277 represents the feature representation vector of military vehi-278 cle instance i at time step t. It should be noted that the size of 279 θ t (i) varies depending on the selected feature layer because 280 we use a hierarchical feature representation strategy. In this 281 paper, the number of output feature map channels for C 3 , C 4 282 and C 5 are 512, 1024, and 2048, respectively. By implement-283 ing ROI Align (divided into 2 × 2) for the region of interest 284 VOLUME 10, 2022 receive a reward r t ∈ R from the environment feedback. 316 It is important to note that agent receives rewards for each 317 decision (i.e., selected action) only during training, while in 318 testing, we follow the trained model strategy and agent does 319 not receive rewards.

320
(1) State: As previously described, the state s t at time step 321 t is a 64-dimensional vector containing the historical 322 state information, the current bounding box informa-323 tion, and the selected corresponding layer feature rep-324 resentation. s t is calculated by Equation (2).

325
(2) Action: As shown in Figure 8, we designed a total 326 of 10 different types of actions in action set A. These 327 actions include 8 actions for coordinate transformation 328 and 2 actions for triggering. These actions include 329 8 actions for coordinate transformation and 2 trig-330 ger actions. Considering all possible situations during 331 the coordinate conversion, the coordinate conversion 332 actions are subdivided into four types: left-right move-333 ment, up-down movement, aspect ratio change, and 334 zoom-in/out. The trigger action is mainly used to select 335 the best feature representation layer at the beginning 336 of the action sequence and to terminate the coordi-337 nate transformation at the end of the action sequence. 338 In this paper, we define the bounding box information  from state s t to state s t+1 by selecting action a t is: At the end of the action sequence, we designed an additional 368 sequence reward to evaluate this sequence: In this paper, we use a total of four evaluation metrics to 408 evaluate the performance of our proposed MVODM, which 409 are precision (P), recall (R), average precision (AP), and 410 mean average precision (mAP). Let P T be the number of cor-411 rectly predicted positive samples, P F be the number of incor-412 rectly predicted positive samples, N F be the number of 413 incorrectly predicted negative samples, and N T be the number 414 of correctly predicted negative samples. Then the expression 415 of precision rate is calculated as: Similarly, the recall rate can be calculated by the following 418 equation Average precision is an evaluation metric that synthesizes 421 precision and recall, which reflects the performance of the 422 detection model on each class target, and its value can be 423 obtained by calculating the area under the precision-recall 424 (PR) curve. Mean average precision is the mean of the aver-425 age precision of all classes of targets, which reflects the 426 performance of the detection model over the entire dataset.   In this experiment, we evaluate the performance based on the 458 recall of different IoUs, which are tested on three subsets, 459 and the experimental results are shown in Figure 9. From 460 Figure 9, it is clear that, firstly, C1 and C2 perform less well 461 on all subsets, mainly because the shallower feature layer 462 cannot extract enough feature information and the feature 463 representation is weak. Comparatively, C3 is a good starting 464 point for feature representation. Secondly, we observe that 465 C4 performs best on the large scale subset compared to other 466 individual layers, while C3 shows better performance on the 467 small scale subset, which indicates that the higher feature 468 layer (C4) has better activation for large scale objects, while 469 the lower feature layer (C3) is more suitable for representing 470 small scale objects. Finally, our hierarchical representation 471 strategy achieves optimal performance for either subset. At an 472 IoU threshold of 0.5, our hierarchical representation strategy 473 can achieve 96.7% (L), 67.5% (S), and 84.2% (A) recall, 474 respectively, which is a more significant improvement in 475 detection performance compared to the previous best results 476 for single-layer features by 2.5% (L), 7.3% (S), and 6.3% (A), 477 respectively.    First, we show a partial example of action sequences for our 541 proposed reinforcement learning-based refinement localiza-542 tion process in Figure 11. As can be seen, only a limited 543 number of action transformations are required for us to obtain 544 a more appropriate bounding box. Such a bounding box 545 is a further directed (facilitated by rewards) refinement of 546 the original proposed box, which can improve the detection 547 performance to some extent. According to the experimental 548 statistics, about 74% of the objects need less than 10 actions 549 to complete the coordinate transformation from the begin-550 ning to the completion, which shows that our strategy is 551 efficient.

552
We show a visual comparison of the detection results of 553 our proposed MVODM with Faster R-CNN in Figure 12. The 554 first and second rows show the detection success cases, which 555 include single and multiple targets, large and small scales, 556 and different battlefield environments. From these two rows, 557 we can see that our MVODM has excellent detection perfor-558 mance and is able to locate and detect the target better. While 559