Abstract:
Recently, most transformer-based approaches have achieved considerable success on vision tasks, even better than those with convolution neural networks (CNNs). In this pa...Show MoreMetadata
Abstract:
Recently, most transformer-based approaches have achieved considerable success on vision tasks, even better than those with convolution neural networks (CNNs). In this paper, we present a novel transformer-based model, named detecting text with transformers (DTTR), for scene text detection. In DTTR, a CNN backbone extracts local connectivity features and a transformer decoder captures global context information from a scene text, effectively. In addition, we propose a dynamic scale fusion (DSF) module that can fuse multiscale feature maps dynamically, thus significantly improving the scale robustness and rendering powerful representations for subsequent decoding. Experimental results show that DTTR achieves 0.5% H-mean improvements and 20.0% faster in inference speed than the SOTA model with a backbone of ResNet-50 on MMOCR. Code will be released at: https://github.com/ahsdx/DTTR.
Published in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 04-10 June 2023
Date Added to IEEE Xplore: 05 May 2023
ISBN Information: