Pelee-Text++: A Tiny Neural Network for Scene Text Detection

Scene text detection has become an important field in the computer vision area due to the increasing number of applications. This is a very challenging problem as textual elements are commonly found in “noisy” and complex natural scenes. Another issue refers to the presence of texts encoded into different languages within the same image. State-of-the-art solutions rely on the use of deep neural network approaches or even ensembles of them. However, such solutions are associated with “heavy” models, which are computationally expensive in terms of memory and storage footprints, which hampers their use in real-time mobile applications. In this work, we introduce Pelee-Text++, a lightweight neural network architecture for multi-lingual multi-oriented scene text detection, especially tailored to running on devices with computational restrictions. Additionally, to the best of our knowledge, this is the first work to evaluate the performance of text detection methods in commercial smartphones. Over this scenario, Pelee-Text++ processes 2.94 frames per second and it is the only evaluated approach that did not cause memory issues on smartphones, even using an input image of $1024\times 1024$ pixels. Our proposal achieves a promising trade-off between efficiency and effectiveness, with a model size of 27 Megabytes and F-measure of 91.20%, 85.78%, 81.72%, 80.30%, 82.53% and 66.51% on ICDAR 2011, ICDAR 2013, ICDAR 2015, MSRA-TD500, ReCTS 2019 and Multi-lingual 2019 datasets, respectively.


I. INTRODUCTION
Nowadays, the detection and recognition of scene texts have become important topics in machine learning and computer vision areas due to the daily use of digital cameras and the huge amount of applications related to this field, such as mobile and context-aware services [60], traffic sign detection [10], [11], image retrieval [58], blind person assistance [61], and text translation [53]. In fact, new applications are still emerging, for example, those related to the interpretation of scene textual content ( Figure 1). However, both scene text detection and recognition are more challenging than traditional document processing given the presence of different types of text and scenarios related to complex/natural The associate editor coordinating the review of this manuscript and approving it for publication was Ahmed Farouk . backgrounds, font styles and sizes, blurring, orientations, occlusion, aspect ratios, perspective projections, among others. Despite the application, the text detection task plays an important role in the final recognition result since this task precedes the recognition step. Thus, the development of a good text detector is paramount to reach efficient and effective scene text recognition systems.
Addressing text detection is difficult since text images have different visual properties depending on their source, for example, born-to-digital (e.g., e-mails, advertisements, Web images) or incidental/focused scene text (scene text images taken from wearable cameras or urban captures). Additionally, text could appear with arbitrary-orientations, perspective distortion, and be associated with even more challenging scenarios, for example, those related to the presence of different languages in the same scene. In this regard, several public datasets have been built to foster the creation of solutions to overcome these problems, such as SynthText [12], ICDAR 2011 [45], ICDAR 2013 [24], ICDAR 2015 [25], ReCTS 2019 [64], Multi-lingual 2019 [40], and MSRA-TD500 [59].
Given the complexity of the text detection problems, sophisticated supervised approaches have been used in the state-of-the-art solutions. Moreover, the detection quality impacts directly on the recognition result, i.e., whether a predicted bounding box covers a entirely word or just part of it impacts on the effectiveness of recognition algorithms.
Recently, the edge computing concept has empowered the next generation of machine learning applications [26]. Conversely to cloud computing, edge computing is revolutionizing the way embedded systems are architected by moving complex processing and analysis to end devices (e.g., mobile and wearable devices). Cloud services brought several advantages for machine learning, such as fast computation and almost unlimited storage; however, its throughput and response time is not enough to ensure its use in real-time applications, which are also impacted by the latency fluctuation in wide-area networks. Furthermore, one of the biggest concerns in mobile devices is the energy consumption, and data transferring over the network implies more energy [5], [26].
The use of tiny neural network models, as a part of fully deployed mobile applications, has some advantages for real-time applications [13], [26], such as: (a) efficiency, in terms of time processing, is one of the most important considerations for real-time mobile applications; (b) local processing, even with the huge amount of online services, some bottlenecks on cloud services or fluctuations on network latency could have a big impact on the performance of real-time applications, and even the privacy leaks could increase; (c) energy consumption is also affected when large deep learning models are used.
Related to text detection, there exist few works using lightweight mobile-oriented neural networks [6], [7], [9]. Some of the approaches based on tiny neural network models [7], [9] are limited to the detection of horizontal text. In this vein, our previous work, Pelee-Text [6], introduced a light architecture for multi-oriented scene text detection, reaching competitive results in several datasets with a model size of 40 Megabytes.
Herein, based on Pelee-text [6], we came up with Pelee-Text++. Pelee-Text++ is the result of a comprehensive study of Pelee-Text. We evaluated the impact on the efficiency and effectiveness of each one of its main components: (i) evaluation of the different convolutional blocks both in behavior and number, (ii) influence of aspect ratios, (iii) impact of different scales of input images. As a result, Pelee-Text++ is an even more compact and simpler neural network architecture for multi-lingual multi-oriented scene text detection suitable for running on devices with computational constraints.
Pelee-Text [6] and Pelee-Text++ are compact neural networks for text detection that have reached a good trade-off between efficiency and effectiveness, becoming promising approaches to mobile applications. Both use the same kind of convolutional blocks: (i) stem block, (ii) transitional blocks, and (iii) dense blocks. However, the main difference between them is the structure of their backbones. Pelee-Text has a total of 27 blocks (1 stem-block, 5 transitional-blocks, and 19 dense-blocks), while Pelee-Text++ has been compressed to 14 blocks (1 stem-block, 5 transitional-blocks, and 8 denseblocks). Additionally, Pelee-Text++ uses 1, 2, 3, 1/2, and 1/3 as aspect ratios, cutting-off the 5 and 1/5 aspect ratios used by Pelee-Text. The new backbone of Pelee-Text++ along with the use of less aspect ratios brought a huge impact on the efficiency and compacting even more the model size.
Regarding efficiency and effectiveness, in most of the scenarios, Pelee-Text++ outperforms the results of our previous work. Moreover, in a real mobile scenario and considering the best setup (input image of 300 × 300 pixels), Pelee-Text++ is 1.46× faster than Pelee-Text and, in the worst case (input image of 1024×1024 pixels), Pelee-Text++ was just 41.50% of the processing time of Pelee-Text. Finally, in terms of model size, whereas Pelee-Text has a weight of 40MB, Pelee-Text++ is only 27MB. VOLUME 8, 2020 Our new proposal was evaluated over six public available datasets: ICDAR 2011 [45], ICDAR 2013 [24], ICDAR 2015 [25], MSRA-TD500 [59], ReCTS 2019 [64], and Multi-lingual 2019 [40] obtaining competitive results against state-of-the-art methods. Experimentally, our proposal demonstrated its ability to work over scenarios with different particularities. Pelee-Tex++ is at least 2.96 times smaller than state-of-the-art methods, with a processing time of 23.25, 15.06, and 3.65 FPS, for its 768, 1024, and multi-scale versions.
Furthermore, to the best of our knowledge, this is the first study that evaluates the efficiency of several text detection methods in a real mobile scenario. For this, we assessed their performance in smartphones using four different scales of input images. These experiments showed the drawbacks of state-of-the-art methods when a real mobile environment is used. On mobile devices, our proposal is capable of processing 2.94 FPS, being at least 5.5 times faster than CRAFT [1], which is one of the best methods in several datasets with a model size of 80 Megabytes.
Our contributions can be summarized as: 1) Design of a lightweight neural network architecture to detect multi-lingual multi-oriented scene texts.

2) Evaluation of text detection methods in devices
with computational constraints, i.e., performance of well-known text detection methods in commercial smartphones.
The remainder of the paper is organized as follows. In Section II, we provide an overview of related work. Our method, Pelee-Text++, is presented in Section III. Section IV details the adopted experimental protocol. Section V presents and discusses achieved results. Finally, Section VI presents the conclusions and future work.

II. RELATED WORK A. SCENE TEXT DETECTION
During the last years, Convolutional Neural Networks (CNN) have become promising approaches to deal with several challenging scenarios in scene text localization, such as multi-scale detection [14], oriented text detection [29], text detection in complex backgrounds [12], and arbitrary-shaped text [36].
On the other hand, text instance segmentation approaches have been working on pixel-wise classification for defining neighborhood linkages [8], salient maps [67], even proposing progressive scale expansion algorithm [50] to improve the separation between near text instances.
Recently, some approaches used several branches with the goal of taking advantage of the fusion of bounding box regression and text instance segmentation techniques [32], [35], [39], [52], [63]. Most of those approaches are inspired by a well-known object detector, Mask-RCNN [16]. In a general way, these methods [18], [31] use a multi-scale feature extraction based on Feature Pyramid Networks (FPN) [30], or even 3D special pyramid mask for a better characterization of text instances [32]. Next, these features are fed to several branches, where each one of them has specific tasks, such as bounding box regression, text instance segmentation or refinement modules.
Other researches have adopted different text detection strategies. CRAFT [1] and WeText [48], for example, works with character-level annotations. CRAFT [1] predicts a region character score using a Gaussian heatmap and an affinity score of neighboring characters, while WeText [48] proposed a weakly-supervised approach to text detection that works over non-annotated data or weakly annotated data, and a graph based method is used to define the final results.
For its part, DRRG [65] dealt with text detection using a graph convolutional network. First, local graphs are created based on the linkage between text components. Then, a deep relational linkage is performed based on the previous discovered local graphs and a Breath First Search is applied to join the linkages for final prediction. Attention maps have also been used in this area. For instance, GISCA [4] improved text characterization based on Contextual Attention Module (CAM) and a Gradient-Inductive Module (GIM).
Unlike previous works, some researches have fused detection and recognition tasks. In this type of approaches, the recognition branch fed the detection one for filtering false positives in an end-to-end scheme [27], [34], [38]. Additionally, some approaches used data augmentation techniques to improve the results and the generalization of text detection methods. In this context, generative models using domain shifts from cross-domain [62] and sampling of sub-regions of text segments through boostrapping [56] were proposed.
Current solutions have addressed text detection challenges by using deep architectures, such as VGG [46] and ResNet [15], or even ensembles of them, producing models with a size ranging from 80 Megabytes [1] to more than 350 Megabytes [38]. Such solutions are computationally expensive, which makes them unfeasible, in practice, to be used in devices with computational constraints, such as memory, computational power, bandwidth and energy [13]. In this regard, some ''mobile'' CNN architectures have been already proposed [19], [22], [44], [49], [66], i.e., lightweight convolutional neural network architectures specifically designed for mobile devices.
Based on MobileNetv2 [44], MobText [7] is the state of the art on ICDAR 2011 [45] with a model size of 37 Megabytes. Based on the same mobile architecture, Fu et al. [9] proposed a neural network of just 16 Megabytes inspired by a U-Net approach [43]. Nevertheless, these two methods were just evaluated on datasets with horizontal text, since they are based on the typical rectangular bounding box representation, which are not able to capture oriented text.
In the same vein, Xue et al. [37] proposed OctShufle, which uses a combination of ResNet blocks [15] and Shuffle units [66] to produce a model size of 88.79 Megabytes. However, it has problems on detecting oriented text. Finally, Pelee-Text [6], which uses PeleeNet [49] as feature extractor, appeared as a mobile oriented architecture for multi-lingual scene text detection with a model size of 40 Megabytes and promising results on several datasets. Pelee-Text provides a promising trade-off between effectiveness and model size.

B. LIGHTWEIGHT NEURAL NETWORKS
In several tasks of computer vision, state-of-the-art approaches have used deep convolutional neural networks to improve results. However, these approaches tend to go deeper, increasing the number of parameters without concerns on the amount of operations and the final model size. For this reason, it is difficult to use these approaches on devices with computational constraints. In order to overcome this problem, some works have focused their efforts on the proposal of lightweight neural networks [22], [42], [44], [66].
MobileNets [19], [44] were proposed as a computational time and model size efficient neural networks for mobile applications. These approaches are based on depthwise separable convolution, where fully convolutional layers are split into two in order to reduce computational time and the number of parameters without greatly affecting performance. Similarly, ShuffleNet [66] applies a grouping approach to divide channels through a point-wise convolution. In addition, this approach uses cross-group information flow.
SqueezeNet [23] is another approach focused on reducing the number of parameters with the goal of compressing the final model. SqueezeNet replaces most of the 3 × 3 filters with 1 × 1 filters and reduces the amount of input channels to 3 × 3 filters. In the same vein, Yolo-Lite [22] is a real-time object detector for non-GPU computers, which is a simplified version of the network proposed by Yi and Tian [41]. This is a very compact network that uses just 7 convolutional layers and does not use batch normalization.
Based on a neural architecture search (NAS), Yi et al. [69] proposed NASNet as a compact neural network model. Instead of looking for a complete architecture, the authors reduced the problem to find the best convolutional layer architecture. Then, the final neural network is built with exactly this type of convolutional layers.
Taking advantage of the properties of well-known networks [15], [20], [47], PeleeNet was proposed as a feature extractor. This method uses a stem block to improve the feature space of the input, two-way-dense layers with different filters with the goal of extract relevant features for large objects, and its bottleneck layer adapts the input dimension dynamically. Furthermore, the authors built Pelee for object detection using PeleeNet along with an optimized version of SSD [33].

III. PROPOSED METHOD: PELEE-Text++
This section presents our proposal, a lightweight convolutional neural network architecture for multi-oriented multilingual scene text detection. More specifically, our goal is to introduce a competitive and efficient approach more appropriate for devices with computational constraints.

A. OVERVIEW
This study presents Pelee-Text++, which reflects our effort towards the design of a tiny neural network based on recent works that rely on mobile-oriented architectures, originally proposed for object detection. More precisely, the solution introduced in this section is an extension of our previous work [6], named Pelee-Text. Our architecture is based on PeleeNet [49] and TextBoxes++ [29] networks. By taking advantage of their best particularities, we propose a faster and lighter architecture for detecting scene text, becoming a more viable solution to be used in constrained processing devices such as smartphones and tablets.
PeleeNet was proposed by Xu et al. [49] as a novel neural network architecture for mobile devices. It is a variant of DenseNet [20] and its main goal was to work on strict memory and computational constraints. Furthermore, PeleeNet along with an optimized version of Single Shot MultiBox Detector (SSD) [33] was proposed for object detection. Unlike original SSD, it does not use the 38 × 38 feature map; but two distinct scales of prior-boxes over 19 × 19 feature map instead.
Regarding to a specific bounding box regression approach for text detection, He et al. [29] proposed TextBoxes++ as an end-to-end convolutional neural network to detect arbitrary-oriented word bounding boxes. To predict the regions in the image that contain texts, the authors used VGG-16 as feature extractor and a modified SSD adapting some layers in order to detect text with longer aspect ratios. At the end, a non-maximum suppression (NMS) procedure is applied to filter the final outputs.
As a result of combining the best of both architectures, we propose a tiny neural network specifically designed for scene text detection. In natural scenes, multiple challenging scenarios emerge, such as different font styles, blurring, orientations, image projections, among others. With the presence of text with different orientations and particular projections, the typical rectangular bounding boxes are not enough, being necessary the use of quadrilaterals.
For that reason, our network predicts quadrilaterals with points P n = (x n , y n ) in clockwise order (P 1 , P 2 , P 3 , P 4 ), being P 1 the top left point. Each one of the predicted quadrilaterals is evaluated as containing text or background. Moreover, for a covering most text regions, we dense prior-boxes based on vertical offsets. Furthermore, we used a simplified version of SSD [33] with different scales of bounding boxes over the 19 × 19, 10 × 10, 5 × 5, 3 × 3, and 1 × 1 feature maps.

B. ARCHITECTURE
An overview of our architecture is presented in Figure 2. Our feature extractor is composed of five stages. In Stage 0, we improve the characterization by increasing the number of channels of the input image. Then, Stages 1 to 4 are based on two blocks of Two-Way Dense Layers with 3 × 3 convolutions, which have the goal of look for useful features to describe text regions. Unlike DenseNet [20] and inspired in PeleeNet, our network manages the channel expansion with two convolutional paths each one working with half of the channels. Finally, a transitional convolutional block keeps the discriminability of the features without impact the number of channels between stages.
Furthermore, extra convolutional blocks from SSD [33] corresponding to the 5 × 5, 3 × 3, and 1 × 1 feature maps are added after Stage 4. Pelee-Text has six text-specific layers designed to look and define final bounding boxes. These layers are built considering 3 × 5 kernels, given than text covers long continuous regions, becoming powerful to detect text that usually has longer aspect ratio than traditional object detection, and also with the presence of textual elements with some orientation and/or projection. They perform bounding box prediction and binary bounding box classification (text or background) at the same time (x 1 , y 1 , x 2 , y 2 , x 3 , y 3 , x 4 , y 4 , confidence).
Our method is based on the regression of offsets taking, as a starting point, a set of prior-boxes predefined for each one of the six layers. For these, before predictions, outputs from five layers pass through residual blocks. Those layers are the last layer of Stages 3 and 4 along with the three layers of the simplified version of SSD. The feature map from the last layer of Stage 3 is used twice using different scales of prior-boxes.
Furthermore, we used some aspect ratios (1, 2, 1/2, 3 and 1/3) with the goal of increasing the amount of default priorboxes. Applying these aspect ratios, we produce larger horizontal boxes, and more important, vertical prior-boxes are generated. This initial set of prior-boxes is the base for the regression of multi-oriented quadrilaterals. Additionally, for vertical crowded text regions, we increase the number of prior-boxes applying a dense vertical distribution in order to minimize the number of text elements lost in this type of scenario. For guiding the training, we use the loss function from [33], which involves the confidence (L conf ) and localization (L loc ) losses: where p are the estimated bounding boxes, c is the confidence of being text, l is the predicted localization, g corresponds to the ground truth, N is the number of matched boxes, and α is the weight for the L loc . Furthermore, we used the smooth L1 loss for L loc and the soft-max loss for L conf . Additionally, during testing, we exploit a multi-scale procedure [6], [29], [34], which uses input images with four different sizes: 384, 768, 1024, and 1536. In Table 1, we can see the heatmaps produced by the bounding box source layers with those four scales as inputs. Each one of these feature maps captures complementary information about textual elements of different sizes, and exploits the relation between different scales and bounding box sizes. The smaller scales (384, 768) allow to detect larger objects, while the larger scales (1024, 1536) are able to capture small textual elements. At the end, before presenting final predictions, a cascade NMS based on the Intersection over Union (IOU) is performed to discard overlapped bounding boxes from the four scales using an IOU ≥ 0.1.
Currently, neural networks are state-of-the-art solutions for scene text detection. However, their solutions are more concerned with effectiveness than efficiency, some works even fuse two or more deep architectures for improving results. In terms of Megabytes (MB), they produce models ranging from 80MB [1] to more than 350MB [38]. In this vein, one of the main points of our proposal is its compact architecture with a weight of only 27MB and 7 millions of parameters, being at least 2.96× smaller than its counterparts, becoming a very promising architecture for mobile applications.

C. ABLATION STUDY
The proposed architecture was built as a result of an ablation study considering our previous approach, Pelee-Text [6]. The objective is to discover the impact of each component and propose our new lighter neural network architecture for scene text detection. Our main concern is to improve its efficiency without hurting too much its effectiveness. In this vein, the impact of four main components are evaluated: (i) stem block responsible for improving the characterization of the image, (ii) dense blocks with the goal of looking for useful features, (iii) transitional blocks that keep the discriminability between stages, and (iv) residual blocks to improve the feature representation before the bounding boxes prediction. Therefore, we cut it off one component at the time to evaluate its influence on our network (See Table 2).
As result, we could see that Stem and Residual blocks are very important as part of the feature extractor, and how the number of dense blocks in the different stages could influence the results and model size of our model. Finally, we selected the architecture built in the sixth setup based on the trade-off between F-measure and model size compared to our baseline, Pelee-Text [6]. Our final architecture has a model size of 27 Megabytes reaching a model compression of 32.5% compared to the baseline with a drop of just 0.38 percentage points on F-measure. More important, as we will explain later, this architecture brings a huge advantage compared to Pelee-Text when a real mobile environment is used to evaluate its performance.

IV. EXPERIMENTAL SETUP
In this section, we describe the datasets, metrics, and protocols used to asses the effectiveness of our method and its counterparts.
The SynthText [12] and MLT 2019 datasets were used for pre-training since some datasets have few images for training. Table 3 shows details about each dataset, such as the number of images, text-orientation, and languages that appear in their images.
Regarding to the metrics, the effectiveness of our method is measured in terms of Recall (R), Precision (P), and F-measure (F1). More precisely, we evaluated our results using the evaluation tool public available by the International Conference on Document Analysis and Recognition (ICDAR) 1 in each one of its competitions. For ICDAR 2013 dataset, we used the ICDAR13 metric, whereas for the MSRA-TD500 dataset, its ground truth is represented in quadrilateral format (P 1 , P 2 , P 3 , P 4 ). Furthermore, the efficiency takes into account the processed Frames per Second (FPS), as well as memory and storage footprints.

B. TRAINING PROTOCOL
For the experiments, we followed a four-stage training scheme where difficult cases were not considered, i.e., text cases with transcription ''###'' in the ground truth. First, we pre-trained our models in two stages using SynthText (Stage 1) and MLT 2019 (Stage 2) datasets with an input image size of 384, then two fine-tuning stages were performed on each dataset with 384 and 768 as input image size, preparing our network for multi-scale detection. For 384 and 768 training stages, we used a batch size of 48 and 20, respectively. Parameter values, such as learning rate, negative ratio, L loc weight (α), and weight decay, were defined empirically through several experiments in the training set.
We pre-trained our models during 10 epochs on SynthText and 200 epochs on MLT 2019. Moreover, we used the Stochastic Gradient Descent (SGD) to optimize our network and the ''Xavier'' technique for initializing the weights. The learning rate, negative ratio and L loc weight (α), weight decay, and momentum were 5 × 10 −3 , 3, 0.8, 5 × 10 −4 , and 0.9, respectively. In Stage 3, we trained for about 300 epochs over almost all the datasets, except for the MSRA-TD500 given that the bounding boxes of this dataset present the particularity of covering text lines instead of words, so we trained for 600 epochs.
Additionally, depending on the stage and dataset, we used different values for parameters, such as learning rate, steps for learning rate decay, L loc weight, ratio between the negatives and positives, and number of iterations. Table 4 shows the parameter values used during the fine-tuning stages.
For testing, we discarded the predictions with a detection score less than 0.5, 0.6, 0. All experiments were conducted on an Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz with 12 cores, 64GB of RAM, Ubuntu 64-bits OS and two GeForce GTX 1080ti.

C. MOBILE PROTOCOL
In order to evaluate the efficiency of our proposal against some of the state-of-the-art methods in mobile devices, we executed experiments over three smartphones: Motorola Moto G6 Play, Asus Zenfone 5, and Xiaomi Mi 9T. These smartphones have different specifications which are detailed in Table 5. As we can see, these smartphones have different RAM Memory capacity ranging from 3GB to 6GB, and with different kinds of processors and clock speed (from 1.4GHz to 2.2GHz).
Moreover, to assess the efficiency of each method, we used the test set of ICDAR 2015 that contains 500 images with a variety of natural scenes. We computed the mean time on the whole test set, and we used four different scales of images (300, 384, 768, and 1024) to evaluate memory footprint and performance in each one of these scenarios.
For this, we built a basic Android app for running different neural network models. Our implementation was based on three Github repositories corresponding to Caffe, 2 PyTorch, 3 and Tensorflow 4 implementations.

V. RESULTS AND DISCUSSION
This section presents the results of our proposal against several state-of-the-art methods, considering three setups: results on datasets containing English text (Section V-A), multilingual text (Section V-B), and experiments related to efficiency aspects on mobile devices (Section V-C). Additionally, we also collected information about the model size and/or number of parameters of the baseline methods. That information was taken from their respective papers or authors' official Github.
In order to deal with a more specific field, some approaches to text detection are adapted or inspired by well-known object detectors [38]. In the vein of mobile-oriented neural network architectures, Liu et al. [7] proposed MobText as a text detector approach based on MobilenetV2 [44]. Their proposal achieved the best trade-off between efficiency and efficacy compared to Yolov3 [42] and SqueezeDet [54]. For this reason, we used MobText [7] as a baseline to compare our approach against mobile-oriented text detection methods.

A. DETECTING ENGLISH TEXT
We evaluated our proposal in three scenarios containing English text, i.e., born-digital images (ICDAR 2011), natural scenes with horizontal or near horizontal text (ICDAR 2013), and natural scenes with arbitrary-oriented text (ICDAR  2015).
First, we evaluated our proposal on the ICDAR 2011, which contains digital-created images that have some JPEG artifacts making this dataset a very different scenario compared to datasets collected from natural scenes. As we can see in Table 6, Pelee-Text++_MS obtained a F-measure of 91.20% outperforming most of the methods, even our previous version, Pelee-Text, and it is only placed behind Mob-Text [7] which is the state-of-the-art on this set of images.
On ICDAR 2013, a dataset from natural scene images with presence of horizontal or near horizontal text, Pelee-Text++ reached competitive results with F-measure of 79.72% and 85.78% for its 768 and multi-scale versions, defeating Pelee-Text with a model 13 Megabytes lighter (see Table 7). Despite having a lower F-measure compared to state-of-theart methods, such as CRAFT [1], MaskTextSpotter [38], PMTD [32] and FTPN [31]; our proposal obtained a good trade-off between efficacy and model size. Figure 4a shows the balance between effectiveness and model size compared to state-of-the-art methods.
Compared to the best methods, our proposal is 12.96× smaller than MaskTextSpotter [38] and 2.96× than CRAFT [1]. On the other hand, focusing on methods with light models, MobileNetv2+UNet [9] (16MB) and Mob-Text [7] (37MB) have F-measures with 9.78 percentage points lower than Pelee-Text++ (27MB). Although Pelee-Text++ has a competitive result, we have to work on the missing cases with special focus on the detection of text in images with the presence of blurring and occlusion (see Figure 3b).
ICDAR 2015 is a dataset that came up with new variants of text orientations in natural scenes, i.e., vertical text, and text with visual projections. This dataset works over quadrilaterals instead of the typical rectangular bounding boxes. For that reason, MobileNetv2+UNet and MobText has not been tested on this dataset. As these methods use a typical rectangular representation, they are not tailored to the localization of oriented text. FCN [67] (57MB) and OctMLT [37]     (88.79MB) are approaches with models relative smaller than most of the methods. They obtained F-measures of just 54.00 and 69.60, which are 27.72 and 12.32 percentage points less than Pelee-Text++.
Pelee-Text++ outperformed Pelee-Text on ICDAR 2015, and reached competitive results against state-of-the-art approaches (FOTS [34], PMTD [32] and PDR [27]); our method was 6.3× and 4.99× smaller than PMTD and FOTS, respectively. Figure 4b shows the effectiveness of each method vs their model size, our proposal is a promising approach considering the trade-off between efficacy and model size. Pelee-Text++ was very close to the best methods  regarding to the Precision; nevertheless, it still missed words as the Recall score shows. Our model had problems to detect vertical word cases and when it does, very low confidence values are assigned (see Figure 3b)).
Finally, as we can observe in Table 7 and Table 8, almost all of the methods are based on VGG-16 [46] and ResNet [15] networks, which lead to heavy models in terms of disk usage. Furthermore, some of them used these networks along with a Feature Pyramid approach [30], and even creating several branches to improve text detection without concerns on their model size hampering their use on devices with computational constraints.
On the other hand, Pelee-Text++ is a promising tiny neural network architecture that reached competitive results using a light architecture favorable for mobile devices, as Figure 3a shows. Additionally, our single scale versions of 768 and 1024 achieved good results and they ran at 23.25 and 15.06 FPS, respectively, while our multi-scale version (384, 768, 1024 and 1536) ran at 3.65 FPS.

B. DETECTING MULTI-ORIENTED MULTI-LINGUAL TEXT
We have evaluated Pelee-Text++ on ReCTS [64], MSRA-TD500 [59] and MLT 2019 [40], which are more challenging scenarios than ICDAR 2013 and ICDAR 2015 datasets. ReCTS and MSRA-TD500 are datasets containing English and Chinese text, and they are line-level datasets, i.e., their bounding boxes cover lines of text instead of single words. On the other hand, MLT 2019 is a dataset containing images with texts belonging to ten different languages: Chinese, Japanese, Korean, English, French, Arabic, Italian, German, Bangla, and Hindi (Devanagari). This multi-lingual task also takes into account punctuation and math symbols present in several images. Table 9 shows the effectiveness of diverse methods on MSRA-TD500. Pelee-Text++ achieved very promising results and placed very close to DRRG [65], which is the best approach over this dataset. The Precision obtained for  Pelee-Text++ was comparable with DRRG, but recall needs to be improved. This is a dataset with several vertical text cases, scenario in which our method has problems to detect and/or assign high confidence values. Nonetheless, as we can see in Figure 4c in terms of disk usage, the model of DRRG is 5.7× larger than Pelee-Text++.  On the ReCTS [64] dataset, our method obtained a F-measure of 82.53%. However, there not exist works in the literature providing results over this dataset. Then, we do not have information to compare our proposal against other approaches in terms of effectiveness and model size.
Our proposal reached a very promising trade-off between efficacy and model size. We can notice that it was able to detect both English and Chinese texts, as depicted in Figure 3. Experimental results showed that Pelee-Text++ outperformed well-known methods, such as PixelLink [8] and TextSnake [36] not only in effectiveness but also in terms of the size of the model. Concerning to the best methods and their models size, CRAFT [1] and DRRG [65] obtained better results than Pelee-Text++; nevertheless, their models were 2.96 and 5.7 times larger than our proposal.
Nowadays, MLT 2019 [40] is one of the most challenging datasets for multilingual text detection; but, there not exist much papers showing the performance of methods over this dataset. For this reason, Table 10 shows results taken from the competition dashboard considering only the methods with an associated paper. As we can see, LOMO [63] is the stateof-the-art method with a F-measure of 83.59%, followed by PMTD [32], CASCADE-RCNN [3], CRAFT [1], and PSENet [50] with 82.53%, 78.38%, 70.86%, and 65.83% of F-measure, respectively. It is worth mentioning that CASCADE-RCNN [3] was cited in the competition site as the associated paper for one of the results. The submission, however, does not have a well-explained description about the modifications of this network for detecting multilingual text.
Furthermore, with respect to the remaining approaches, as mentioned before, all those methods are based on deep architectures, such as VGG and ResNet. For instance, LOMO (without information about its model size) and PMTD (170MB) use ResNet-50 along with FPN, whereas CRAFT (80MB) and PSENet (110MB) use VGG-16.
Additionally, LOMO and PMTD, the top-2 methods on this dataset, use several branches for specific tasks in order to improve results but impacting directly on their model size. In contrast, Pelee-Text++, with a model size of 27MB, obtained better results than PSENet (4.07× larger), and it was close to CRAFT (2.96× larger). Similarly as we described before, our network has to be improved with regard to the detection of vertical align text. One possible research venue concerns the use of a different training protocol specifically tailored to this multi-lingual scenario.

C. RESULTS ON MOBILE DEVICES
This section presents the results of our proposal against five methods on three smartphones (Motorola Moto-G6, Asus Zenfone 5 and Xiamoi Mi T9). Details about the smartphones and the app are described in Section IV-C.
For comparison purposes, we used the models available on the authors' official GitHub. Along with our proposal, the methods considered for evaluation were: Pelee-text [6], MobText [7], TextBoxes++, 5 CRAFT, 6 and PSENet. 7 These methods were implemented using different frameworks: Pelee-Text, Pelee-Text++, and Textboxes++ used Caffe, MobText was implemented on Tensorflow, whereas CRAFT and PSENet used Pytorch. These experiments only considered the processing time of each model to generate its outputs without taking into consideration neither pre or post processing procedures of each approach. Table 11 presents the processing time and standard deviation of each model over different smartphones to process an image using four different scales: 300 × 300, 384 × 384, 768 × 768, and 1024 × 1024. As we can see, PeleeText++ was the only method that ran on all three smartphones without causing memory issues with the four scales. Moreover, it had the fastest inference time in all scenarios, except for an image of size 300 × 300, where the best was MobText. As we explained in previous sections, MobText is a network specifically designed for working with image input size of 300×300. Additionally, this method has limitations for scenarios with oriented text where quadrilateral representation is needed.
TextBoxes++, the method with the larger model size (133MB) in this experiment, was not able to run with image size of 768 and 1024 in any of the three smartphones, causing memory problems. On Motorola Moto-G6 with 3GB of memory RAM, Pelee-Text, CRAFT, and PSENET caused memory issues with an image size of 1024. Compared to CRAFT, one of the state-of-the-art methods on the datasets evaluated in this work, Pelee-Text++ was 7.44×, 7.89×, and 8.76× faster using an input size of 300, 384, and 768, respectively.
On Asus Zenfone 5 with 4GB of memory RAM, all the methods executed faster. Over this environment Pelee-Text, CRAFT, and PSENet could perform inference even with an image size of 1024. Respecting to the inference time on Moto-G6, Pelee-Text++ had a time reduction of 77.77% on the 300 scale image, 41% on the 384, 48.09% on the 768 and 65% with an image size of 1024. Furthermore, considering 223184 VOLUME 8, 2020 the best time of each method, our proposal was 1.46×, 5.94×, 4.83×, 10.29× faster than Pelee-Text, CRAFT, PSENet, and TextBoxes++, respectively. In contrast, considering the worst time of each one, which is when an input image of 1024 is used, Pelee-Text++ was just 41.50% of the processing time of Pelee-Text, 14.40% of the CRAFT's inference time, and 10.48% of the PSENet. Figure 5 shows the performance of the assessed methods on XIAOMI Mi T9, which has 6GB of memory RAM. Even on this smartphone, TextBoxes++ could not complete its inference when the input image was 768 and 1024 because of memory issues. On the other hand, MobText reached the best time when an input image of 300, but it was tested only over this scenario due to its nature of being specifically designed for this input size.
Additionally, we can see how the gap between tiny vs large models increases with a bigger input image. Without considering MobText and TextBoxes++, because of the previous related issues, the gap between Pelee-Text++ (best time) vs the worst time was of 2.48 seconds for an image of size 300, while for 1024 there exist a huge difference of 14.32 seconds.
It is difficult to compare methods implemented with different frameworks and programming languages, but the experiments performed on three commercial smartphones showed a big gap in efficiency between the approaches based on large deep learning models and our proposal. Finally, based on these results and the effectiveness obtained by the methods in the evaluated datasets, Pelee-Text++ seems to be a very promising architecture for text detection being a light, fast, and competitive approach vs state-of-the-art methods. Pelee-text++ with a model size of 27MB was capable of processing 2.94 FPS in a real mobile setup, being at least 5.5 times faster than CRAFT, which is one of the best methods in several datasets.

VI. CONCLUSION
Unlike other works in the text detection field, which has been increasing the model size of their approaches adopting the use of very deep architectures or even fusing several task-specific branches, we presented a very promising lightweight neural network for dealing with scene text detection and with special focus on measuring the performance of our proposal against state-of-the-art methods in mobile devices.
We proposed Pelee-Text++, a mobile-based convolutional neural network specifically designed for detecting text which usually has a longer aspect ratio. One of the main problems is to capture text with some types of orientation that are not fully covered by rectangular bounding boxes. For that reason, our approach uses quadrilaterals instead. VOLUME 8, 2020 The experimental results over five well-known datasets involving born-digital images, horizontal/vertical text aligned, and multi-lingual setups in real scene images, show a great performance of the Pelee-Tex++ in terms of effectiveness and efficiency. Our network obtained competitive results against state-of-the-art methods and showed a very promising trade-off between effectiveness and model size. On GPU, it runs at 23.25, 15.06 and 3.65 FPS for our 768, 1024 and multi-scale versions, respectively. More important, our proposal demonstrated its efficiency in three smartphones, being faster than well-known methods. Pelee-Text++ has a model size of just 27 Megabytes and processes 2.94 FPS on a smartphone being at least 5.5 times faster than CRAFT, which is one of the best methods in the text detection area.
Concisely, we demonstrated that our approach is more appropriate for restrictive computing scenarios. Pelee-Text++ presents outperforming results in terms of time processing and model size reduction, becoming a promising approach to avoid disk usage and RAM memory consumption problems in mobile devices. Despite of the competitive results reached by Pelee-Text++ in arbitrary-oriented (horizontal and vertical text) and multi-oriented text; one of the limitations of our proposal is that Pelee-Text++ is not capable of working well on datasets presenting irregular and curved text because of its nature, i.e., prediction of quadrilaterals that do not fit well in these types of scenarios. We intend to address these issues in future works.
Additionally, future research efforts will focus on smart data augmentation strategies, for instance, rotation, cropping, among others, for detecting the missing cases, such as text in images with the presence of blurring and occlusion, detecting (near)-vertical textual elements. Furthermore, the proposal of training and testing protocols for multi-lingual scenarios, and evaluation of the impact of model compression methods for obtaining a more compact neural network without loss of effectiveness.