Detecting and Removing Text in the Wild

Scene text removal is a challenging task that aims to erase wild text regions that include text strokes and their ambiguous boundaries, such as embossing, shade, or flare. The challenging issues raised in the wild are not completely addressed by the existing methods. To address these issues, we propose a new loss function for blending two tasks in a new network structure that depicts wild text regions in a soft mask and selectively inpaints them into a sensible background. The proposed loss function aids the learning of two seemingly separate tasks in a synergistic way via the soft mask to achieve remarkable performance in scene text removal. We validate our method through qualitative and quantitative comparisons, and region-wise analysis, showing that our method outperforms existing methods.


I. INTRODUCTION
Scene text removal is the total removal of text from an image as if there were no text originally. Thus, it aims to erase text and restore a proper background. Scene text removal has been widely used to secure sensitive private information, such as names and addresses, and utilized for other computer vision tasks, such as text replacement for language swaps. Unlike image inpainting [2]- [6], in which an explicit mask of an inpainting region is given, scene text removal requires identifying the inpainting region. Therefore, capturing the location of text in an image is necessary in scene text removal. However, text in the wild consists of various types of fonts, which may be irregularly shaped and have varying levels of illumination as well as perspective distortion, making them challenging to localize. In this paper, such a text has been referred to as wild text.
Early text removal researches [9]- [13] utilized two separate techniques: text stroke segmentation [14]- [16] and image inpainting [2], [3], [17]- [19]. However, these are commonly used for removing subtitles or captions, which are printed in a straight direction on a simple background. Recent deep-learning-based methods [1], [20] have adopted a deep neural network that erases wild text in a data-driven way without explicit text localization. These methods include gathering pseudo-background data as the ground truth by manually erasing the original text in an image and trained The associate editor coordinating the review of this manuscript and approving it for publication was Tony Thomas.  [1] results, and (c) our results. Each case represents three typical failures in (b), 1. un-erased text regions, 2. falsely erased non-text regions and, 3. more complicated text un-erased in the wild. more effectively than the early approaches [9]- [13]. However, the text removal methods still result in failure cases: falsely un-erased text regions and erased non-text regions, as shown in Figure 1 (cases 1 and 2). Furthermore, wild text can be more complicated, as seen in Figure 1 (case 3), where ambiguous text boundaries may appear around the text stroke, such as three dimensional embossing and surrounding shade. These deep-learning-based methods [1], [20] also fail to localize such challenging wild text.
The objective of our paper is to tackle the challenges and mitigate the ambiguous text boundary problems in localizing and inpainting the wild text regions. To this end, instead of utilizing an existing single net structure, we devise a two-task blending structure that performs wild text segmentation and selective background inpainting at the same time. In addition, we propose a blending loss function to train the two-task structure in an end-to-end synergistic way. With the blending structure and loss function, our model can localize ambiguous text boundaries inclusively in a soft mask and erase the wild texts completely.
In the proposed loss function, the part for wild text segmentation is designed to produce a soft mask that indicates wild text regions. The part for selective background inpainting is designed to restore a visually plausible background only in the region where the wild text is erased. The soft mask plays a key role in linking two loss functions for end-toend training and blends the input image and the predicted background to produce a text-removed output. Through the experiments, we show our method outperforms the state-ofthe-art methods [1], [20] in terms of the image quality and degradation of text detection [21], [22].
Our contributions are summarized as follows: • We propose a two-task blending structure that performs both segmentation of a wild text area and restoration of a plausible background in that area.
• We design an end-to-end loss function to train the twotask blending structure in a synergistic way.
• The proposed model effectively handles ambiguous text boundaries, such as shade, embossing, and flare.

II. RELATED WORK
Early text-removal approaches [9]- [13] focused on erasing subtitles and captions in digital-born images. They used edge-based [16], connected component analysis-based [15] or stroke-filter-based [14] methods to segment the text stroke. However, the stroke segmentation algorithms failed to localize wild text in real scenes. Eventually, cascaded image inpainting algorithms [2], [3], [17]- [19] were able to restore the background only in text strokes. Since these approaches naively combine the independent text stroke segmentation and image inpainting algorithms, they fail to remove text in wild scenes, which is the ultimate goal.
To overcome the problem of each function operating separately, deep-learning-based methods emerged to train scene text removal in a data-driven way. Scene-Text-Eraser [20] made the first attempt to erase text from real scenes using a deep neural network. To solve the data-deficiency problem of scene text removal, that is, the fact that non-text versions of real data do not exist, Scene-Text-Eraser generated 229 pseudo-background images as the ground truth from the ICDAR13 [7] dataset by masking the original images with the given stroke annotations and inpainting with the third-party algorithm [2]. Then, Scene-Text-Eraser used pairwise input images and pseudo-background images to train a deep neural network, which produced text-removed images. However, the scene text removal performance heavily relied on the heuristics of the pseudo-background generation, which followed aforementioned naive approaches [9]- [13]. Although Scene-Text-Eraser dilated the given text stroke mask to cover ambiguous text boundaries in the pseudo-background generation, it still could not deal with more complex wild text. Furthermore, the 229 generated samples were not enough to train a deep network.
The data-deficiency problem of scene text removal mainly arises because there is no dataset that annotates wild text. Therefore, EnsNet [1] simply used human labor to remove text from images. EnsNet generated 1, 000 pseudobackground images by manually erasing text from the ICDAR17-MLT [23] dataset. However, the 1, 000 image pairs were still not enough to train a deep network, so they obtained an additional 8, 000 pairwise data by synthesizing text on random background images using the Synth-Text [24] engine. They trained a deep neural network to regress an output to the pairwise background ground truth in a similar way as with Scene-Text-Eraser [20]. Therefore, wild text is implicitly localized and replaced with a plausible background. Through the ResNet [25]-based framework and a multi-scale regression loss, EnsNet achieved reasonable scene text removal performance. However, EnsNet falsely removed non-text regions or leaves text in complicated scenes due to the poor text localization performance, as shown in Figures 1 and 2. EnsNet may have focused more on how to design a regression-based inpainting loss function to generate a plausible background and less on how to localize text in an image. Furthermore, the data-deficiency problem remained, as their pseudo-background generation is inefficient when a model needs to generalize to different domains of text. For example, EnsNet [1] trained its model on linearly oriented alphanumeric text, but to generalize the method for use on multi-lingual and curved text, additional costly annotation efforts are required to generate image pairs with and without such text. Other notable and recent scene text removal works are [26], [27] with additional text segmentation and GAN [28]-based inpainting approach. These methods also achieve scene text removal in good quality, but a quantitative comparison is limited as targeting dataset is different.  [1], and (c) text-removed results of ours with the ICDAR13 [7] and total text [8] datasets.
In summary, the previous methods [1], [20] have focused on generating pseudo-background images as the ground truth and using a sole inpainting loss to remove wild text without explicit text localization. Unlike previous methods, we do not incur high annotation costs on pseudo-background generation but rather use a synthetic text dataset with cost-free background images and a real scene-text dataset [7], [8] with text annotations. To tackle the poor text localization performance of previous methods [1], [20], we devise a model that explicitly localizes challenging wild text and includes ambiguous text boundaries in a wide variety of real scenes. Therefore, we introduce a novel architectural solution with a blending loss, which jointly trains wild text segmentation and selective background inpainting.

III. PROPOSED METHOD
The existing methods EnsNet [1] and Scene-Text-Eraser [20] adopt a U-net [29] structure for scene text removal, which is referred to as single-net models in this paper (Figure 3 (a)) for convenience. Single-net models are trained with a sole inpainting loss function, which regresses the output to background ground truth (GT). In contrast to the single-net models, as shown in Figure 3 (b), we propose a blending structure of wild text segmentation and selective background inpainting to enhance localization and restore a plausible background in wild text.

A. NETWORK STRUCTURE
Our model has two modules for wild text segmentation and selective background inpainting. Both modules share features and receive skip connections [29] from the backbone network. Let I in be the given input image containing text. From the backbone feature of I in , the wild text segmentation module predicts a soft maskM , which indicates regions to be removed with continuous values between 0 and 1. The selective background inpainting module predicts a background imageÎ bg to be filled in removed regions. The distinguishing aspect of this work is that the blending module constructs the text-removed imageÎ out by blending I in ,M , andÎ bg aŝ where ⊗ denotes the element-wise multiplication. As the blending operation is differentiable, the back-propagated gradients fromÎ out flow into the segmentation and inpainting modules and the backbone network. Such blending approach has been effective in other object removal or editing tasks [30], [31]. Pumarola et al. [31]  proposed face editing framework which produces attention and color masks to compose final face-edited output from the input face image. This framework does not locate faces as its input is a full face image. The face-editing location such as eyes and lips is not available as ground truth, thus computed by the network via attention mechanism. However, wild text segmentation requires accurate text detection and ambiguous text region inclusion. While text location is annotated as strokes or boxes, ambiguous text regions such as embossing, surrounding shade, and glowing flare do not have ground truth annotation. Therefore, we divide the maskM into clear and ambiguous regions to cope with both situations when ground truth is available and not available. We will elaborate more about wild text segmentation in the following section. Figure 4 illustrates a typical case of wild text, transcripted as ''SOPRA'' with surrounding glowing flare. We may think of simply predicting the text foreground maskM to be the text stroke mask M with a binary cross entropy loss and inpainting the localized text area. Letting M andM be the stroke GT and its prediction respectively, the binary cross entropy loss L s for stroke segmentation is given by

1) WILD TEXT SEGMENTATION
where p indicates the pixel position, and N is the number of pixels in a mask map. As the segmentation module with binary cross entropy loss L s solely predicts the text area as text stroke and the inpainting module inpaints the localized region ofM with an inpainting loss, the two modules are trained separately, and we refer to this as 'Separate training' in Figure 4 (a). Although separate training predicts a reasonably fine text stroke inM , it fails to remove ambiguous text boundaries which remain in the outputÎ out . Thus, the word 'SOPRA' is still readable in the text-removed outputÎ out .
Accordingly, the blending loss function is proposed to jointly train both segmentation and inpainting modules for wild text segmentation. As shown in Figure 4 (b), joint training enables localizing ambiguous text boundaries inclusively inM , and wild text is completely removed inÎ out . Details of the loss function will be covered in the later section.

2) SELECTIVE BACKGROUND INPAINTING
The single-net models [1], [20] directly output the textremoved image asÎ out (see Figure 3 (a)) trained with a sole inpainting loss function, which regresses the output to background GT. That is, the single-net models unnecessarily reconstruct non-text regions in the input along with background restoration in text.
On the other hand, our selective background inpainting outputsÎ bg (see Figure 3 (b)), in which text regions fromM are selectively inpainted with background restoration. Therefore, as there is no need for redundant input reconstruction, non-text regions are bypassed from the input image.

B. TRAINING SCHEME 1) PRETRAINING WITH TEXT STROKE SEGMENTATION TASK
We pretrain the backbone network and the wild text segmentation module with stroke segmentation task as the prior knowledge for scene text removal. For the pretraining, we use the stroke segmentation loss L s defined in Eq. (2).

2) BLENDING LOSS FOR END-TO-END TRAINING
After the pretraining phase, we train the two-task blending structure by minimizing the following end-to-end objective with the aforementioned two-task outputs: where L m is the loss function for explicit text stroke segmentation, and L i is the loss function for ambiguous text boundaries inclusion and selective background inpainting. Becausê M is included in both of the loss terms, our models can be trained so thatM produces a soft mask including ambiguous text boundaries. We use the weighting parameters of λ m and λ i for balancing the corresponding losses. Through the endto-end learning, the wild text segmentation module is trained to produce a soft maskM , which can then guide the selective background inpainting module, producing an appropriateÎ bg . That is, the wild text segmentation module receives two ways of supervision from L m (M ) and L i (M ,Î bg ), and this is the major difference from the other methods [1], [20] whose models are trained by a sole image inpainting loss. As shown in Figure 5, additional information (M ,M , and I bg ) is needed to construct the loss, but it is omitted in Eq. (3) for simplicity.
The loss functions L m and L i are applied to different regions. As illustrated in Figure 6, we distinguish the masks M ,M , and the region A, A c . M orM denotes the ground truth of a binary mask for text stroke or text box covering the text stroke, respectively. The positive pixels in M andM are encoded by one, and others by zero. Then the text box inM is annotated so that M ⊗ (1 −M ) = 0, i.e., text stroke must be inside the text box. The region A is defined by a set of pixels which belong to text stroke or clear background. That is, A = {p|M p = 1∨M p = 0} where p is a pixel position and ∨ is logical OR operation. Then A c becomes an ambiguous text region that includes pixels outside the text stroke and inside the text box. The reason to define M ,M , and A/A c is to handle ambiguous text regions (shade, embossing, flare, etc.), which is a challenging task in our paper.
By using A, the loss L m in Eq. (4) takes a role of learning the maskM for region A (text stroke and clear background region except the ambiguous region). Because there is no ground truth that annotates the ambiguous text region, we cannot include the learning the mask to predictM for the region A c by L m in Eq. (4). For the remaining area A c (ambiguous text region) for the predicted maskM , we further designed the loss L i in Eq. (5).
L m (Explicit Text Stroke Segmentation): There is no such ground truth that annotates the ambiguous text region, thus L m is designed not to penalize pixels in region A c . We use the binary cross entropy on each pixel of A as the loss, i.e., where N A is the number of valid pixels in A for normalization. Note that the second term in Eq. (2)  While single-net methods (Figure 3 (a)) restore background of an entire image region, our proposed structure (b) and the loss function L i [Î out (M ,Î bg )] train our model to predict a maskM that indicates where edit is needed and a background imageÎ bg . With blended outputÎ out fromM ,Î bg and Eq. (1), the loss function L i induces the backgroundÎ bg inpainting selectively on the positive regions ofM . We utilize text box GTM to mask back-propagated gradients by the loss L i to train background restoration only inM . To remove text textures completely, not only text stroke but also ambiguous text boundaries should be covered up in the predicted maskM . Since there is no explicit ground truth annotation for ambiguous text region, the text-free background ground truth is utilized to train our model to output non-text texture background. Therefore, the ambiguous text region in area A c is included in the maskM for complete text removal inÎ bg . L i is designed with the L 1 loss L L1 , the style loss L stl [32], the perceptual loss L perc [33], and the total variation loss L tv [33]: The L 1 loss L L1 minimizes pixel distance between the outputÎ out and ground truth I bg in the text box regionsM as Because the L 1 loss could result in blurry output, the perceptual loss L perc is adopted for contextual image restoration and the style loss L stl is adopted for better texture restoration. L perc and L stl use the ImageNet [34] pretrained VGG16 [35] model as the feature extractor φ(·) and compute feature distance between the output and the background GT. The perceptual loss L perc is defined by which drivesÎ out to have similar semantic features to I bg . The style loss L stl is defined by where N D denotes the size of the feature map. The style loss causesÎ out to have a similar style as I bg . We extracted feature maps at the {1, 2, 3}-th pooling layers of the VGG16 model VOLUME 9, 2021 for L perc and L stl . The total variation loss L tv encourages spatial smoothness by minimizing the distance among adjacent pixels (x, y) of the output imageÎ out as ResNet34 [25] has been used for the backbone network of both our model and the re-implemented EnsNet [1] for a fair comparison. We have adopted four skip connections from the backbone network to the decoding features of the wild text segmentation and selective background inpainting modules.
We have used bilinear up-sampling and subsequent convolutional layers with kernel = 3, pad = 1, and stride = 1 to upscale features in the segmentation module. For the selective background inpainting module, we have used gated convolution [5], which is well known to be effective for inpainting irregularly shaped regions. Wild text in an image is usually irregularly shaped and thus the gated convolution is much more effective than the standard convolution for this task. The gated convolution is composed of two convolution operations f gate and f feat that output the gating weights and the features with the same dimension, respectively. The gated convolution is defined as where C is the gated convolution operation, φ is an activation function, σ is a sigmoid function, and x is an input. The exponential linear units [37] are used as activation functions. We have used transposed convolutional layers with kernel = 4, pad = 1, stride = 2 for upsampling. While ResNet34 uses batch normalization (BN) [38] for feature normalization, instance normalization (IN) [39] is known to be powerful in image-to-image translation tasks, such as style transfer [40], domain transfer [41], and image inpainting [4]. Similarly, we have observed that IN outperforms BN in scene text removal models. Therefore, we have used IN instead of BN after every convolutional layer in every model. In gated convolution blocks, IN is applied after element-wise multiplications. Both weighting values λ m and λ i are empirically set to 1. λ L1 , λ stl , λ perc , and λ tv are set to 6, 120, 0.05, and 0.1, respectively, in the same way as in [6].
While single-net models only rely on pairwise data of images with text and their pseudo-background, the proposed model, which is a two-headed architectural solution with the blending loss, additionally utilize real-world dataset with wild text. Half of the training batch for the proposed model is trained with L m on the ICDAR13 [7] and Total Text datasets [8] to localize challenging wild text in real scenes. And other half of the training batch is pairwise data from our synthetic data with the complete blending loss. Singlenet models use the full batch of pairwise data. All of the implemented models are trained with a batch size of 16 and an image resolution of 512 × 512 with random crops and scale augmentation with a ratio from 0.5 to 3.

IV. EXPERIMENTS
In this section, we describe the datasets, evaluation metrics, and experimental results.

A. DATASETS
The ICDAR13 [7] dataset is one of the most popular scene-text detection benchmarks. The dataset has 229 training images and 233 test images, including text stroke and text box annotations.
The Total Text [8] dataset is a scene-text dataset for dealing with various orientations of text. It contains 1, 255 training images and 500 test images. Total Text contains text strokes, text-bounding polygons, and text box annotations.
The SynthText++ dataset is a synthetic scene-text dataset we generated to provide richer annotations from Synth-Text [24]. SynthText is widely used in text detection [21], [22] and text recognition [42]. We generated the synthetic images with text upon text-free images and split them into 93, 888 training images and 483 test images without sharing background images. We modified the SynthText engine to additionally produce text stroke annotations. SynthText++ provides scene-text images with text stroke, text box, and background GT.
The EnsNetSet dataset is a dataset first shown in EnsNet [1], which includes 8, 000 training images and 800 test images. Since text stroke and text box annotations had not been released, we used this dataset only to evaluate and compare the scene text removal models.
The SynthText++ dataset was used to train singlenet models. The ICDAR13, Total Text, and SynthText++ datasets were used to train our models. The test portion of all datasets was used for the evaluation of all models.

B. EVALUATION METRICS
We used two metrics for evaluation based on the previous works [1], [20]. First, we used the text detection methods on text-removed images and original images from the ICDAR13 and Total Text datasets. We measured the performance degradation of the detectors to estimate the degree of text erasing. Two popular text detectors (EAST [21] and CRAFT [22]) were used on the ICDAR13 dataset. CRAFT was used in the Total text dataset because it detects text with irregular shapes in the form of polygons.
Second, to measure the image quality of text-removed images, Mean Squared Error (MSE) and Structural Similarity (SSIM) [43] were used. MSE measures the mean squared error between two images, with the lower MSE values indicating better text removal performance. SSIM measures the mean structural similarity index between the two images, with the higher SSIM values indicating better text removal performance.

C. EXPERIMENTAL RESULTS
In Table 1, we report the performance degradation of text detection on text-removed images. Lower detection  performance means better scene text removal performance. Thus, this measure represents how well text is erased by each scene text removal method. The experiment was conducted using the ICDAR13 and Total Text datasets with two text detectors: EAST [21] and CRAFT [22]. We first evaluated text detection performance on original images with text. Then, we evaluated text detection performance on text-removed images with each scene text removal method and measured the degree of performance degradation. Notably, recall is an important metric for measuring the amount of detectable text remaining in text-removed outputs.  [21] and CRAFT [22], are evaluated on the text-removed results of text removal models (Scene-Text-Eraser [20], EnsNet [1], and ours) using the ICDAR13 [7] and Total Text [8] datasets. The CRAFT performance of EnsNet is evaluated from our re-implemented EnsNet. Lower text detection performance means better text removal performance. 'Ours (separate training)' denotes when text stroke segmentation and background inpainting tasks are trained separately. 'Ours' is when two tasks are jointly trained with the blending loss. Figure 4 shows how joint training effectively removes wild text. See section IV-C2 for further discussion.

TABLE 2.
Measurement based on image quality. We measured the MSE, and SSIM scores of the scene text removal models using the EnsNetSet dataset. are reported from original EnsNet [1] paper. EnsNet * shows the reproduced results from our re-implementation.

1) COMPARISONS TO SINGLE-NET MODELS
In Table 1, 'Input Images' represents the original performance of the text detectors. 'Pix2Pix' [36], 'Scene-Text-Eraser' [20], and 'EnsNet' [1] are other scene text removal methods. 'Single-net' is the single-net structure shown in Figure 3 (a), which is trained with a sole inpainting loss function. 'Ours' is the proposed two-task blending structure with the novel end-to-end blending loss. 'Ours' outperformed every single-net model, including Scene-Text-Eraser [20], Pix2Pix [36], EnsNet [1], and our baseline 'Single-net' with both datasets with both detectors. The results verify that our method has superior scene text removal performance compared to the state-of-the-art methods. To train a single-net model on scene-text data, one should manually erase all text in the images to generate the pseudo-background GT. The manual text-erasing process with photo-editing tools such as photoshop require heavy and professional human labor, thus EnsNet is trained on very few data of 1, 000 images from ICDAR17 text detection dataset and results in poor text localization performance. Meanwhile, our method with L m (4) can easily utilize the wide variety of real data with text stroke and text box annotations, which are popular and already available annotations for text detection. Therefore, our method is more scalable and easier to generalize to different domains of text than single-net models. Table 2 shows the scene text removal performance with the EnsNetSet dataset [1]. Two metrics (MSE and SSIM) were used to measure the image quality of textremoved images. We re-implemented EnsNet trained on SynthText++, denoted as EnsNet * , for fair comparison with 'Single-net' and 'Ours'. Our method outperforms our baseline 'Single-net', the existing state-of-the-art methods [1], [20], [36], and re-implemented EnsNet * on all metrics.

2) THE SUPERIORITY OF JOINT TRAINING OVER SEPARATE TRAINING
To demonstrate the effectiveness of the joint training scheme, we introduce another variant 'Ours (separate training)' (see Figure 4 (a) and Table 1). 'Ours (separate training)' has an identical network structure to 'Ours', but its segmentation and inpainting modules are trained separately by replacing L m (4) of 'Ours' to L s (2) and blocking the back-propagation path fromÎ out toM . As shown in Figure 4 (a), 'Ours (separate training)' does not erase ambiguous text boundaries, leaving clearly legible text in the output. Therefore, 'Ours (separate training)' demonstrates the worst performance by leaving the most detectable text in Table 1.
Note that the proposed two-task structure with separate training exhibits worse performance than the single-net model because the single-net model is trained to produce the text-removed image directly, whereas the separate training structure for two-tasks erases only the text stroke region and does not erase the ambiguous region. Thus, the effect of the two-task structure cannot be achieved without the proposed blending operations in our loss function.

3) QUALITATIVE COMPARISONS
For qualitative comparisons, we display our results on challenging scene-text images in the ICDAR13 and Total Text Qualitative results. The figure shows input images (I in ) and text-removed images (Î out ) of EnsNet [1] and ours in the ICDAR13 and total text datasets. We also provide image difference ( = |I in −Î out |) to visualize where the input image is modified. datasets in Figure 7. Our method was compared with the state-of-the-art method EnsNet [1]. The result in Figure 7 shows that our model outperforms EnsNet on challenging real scenes with wild text by training on scene-text data with text location annotations with an explicit text stroke segmentation loss function L m (4).
In the next section, we perform region-wise analysis to verify whether our method outperforms scene text removal models based on the single-net approach.

D. REGION-WISE ANALYSIS
In order to detect damages to non-text regions, it is necessary to divide the regions for measuring image quality. Denoting Q as an image quality metric (either MSE or SSIM), the EnsNet [1] paper evaluated the image quality only in the full-image (Q(Î out , I bg )), as shown in Table 2. Text usually occupies a small portion of an image, so the performance largely depends on the non-text area that occupies most of the image. Using the denoted regions in Figure 8, we further measured and reported image quality in two more regions, the inner and outer text box regions ofM . The image quality evaluations of the additional regions are defined as: Non-text regions : whereÎ out,i and I bg,i are the i-th text box regions of identically cropped and warped into rectangular images fromÎ out and I bg , and n is the number of text boxes in the image. The evaluation of text regions indicates how accurately a model localizes text regions and the plausibility of the generated background. The evaluation of non-text regions indicates how robust a model is with regard to text-like textures and how accurately it reconstructs the input. We only evaluated the image quality of two additional regions in the SynthText++ dataset, because the EnsNetSet dataset does not provide text box annotations. Table 3 shows the performance of our variants when evaluated on the SynthText++ dataset. We evaluated image quality metrics in three different regions: full-image, text, and non-text regions. We investigated the effectiveness of pretraining with text stroke segmentation task, which we referred to as 'Pretrain', as mentioned in Section III-B1, and the two-task blending structure as 'Ours'. We present 'Single-net', 'Ours w/o Pretrain', 'Ours', and for reference, 'EnsNet' [1]. We found that 'Ours' outperformed 'Ours w/o Pretrain', which implies that the prior knowledge from text localization provides useful clues in scene text removal. While the image quality improves gradually in the text-region from 'Single-net' to 'Ours', there is a greater improvement in image quality between 'Single-net' and 'Ours w/o Pretrain' in the full-image and non-text regions. This shows TABLE 3. Region-wise image quality analysis. We measured the MSE and SSIM scores from the SynthText++ dataset in three regions: the full-image, text, non-text regions. 'Pretrain' means that a model is pretrained with stroke segmentation data as a prior knowledge.
that the single-net models are poor at reconstruction in nontext regions, whereas 'Ours' is robust with regard to textlike textures, as it bypasses non-text regions from the input with explicit wild text localization and selective background inpainting. Figure 9 shows the qualitative results of EnsNet and our method, with an image difference map to visualize where input image is modified. While EnsNet fails to detect complex text and erases false-positive text-like textures, our method captures and removes text only with plausible backgrounds.

V. CONCLUSION
To address wild-text-related problems in scene text removal, we have designed a novel loss function with a two-task blending structure. The elaborately designed loss function enables the segmentation task to include ambiguous text boundaries appearing frequently in the wild, and the inpainting task selectively restores plausible background only in wild text regions. The proposed method has additional merit, as a wide variety of text location data can be easily utilized in training for robust wild text localization. As validated in the experiments, the proposed method shows outstanding capability in removing wild text regions and outperforms the existing methods. In the future, the proposed method could be efficiently utilized for various applications in the wild. JIN YOUNG CHOI (Member, IEEE) received the B.S., M.S., and Ph.D. degrees in control and instrumentation engineering from Seoul National University, Seoul, South Korea, in 1982, 1984, and 1993, respectively. From 1984to 1989, he was with the Project of TDX Switching System, Electronics and Telecommunications Research Institute (ETRI), Daejeon, South Korea. From 1992 to 1994, he was with the Basic Research Department, ETRI, where he was a Senior Member of the Technical Staff involved in the neural information processing systems. Since 1994, he has been with Seoul National University, where he is currently a Professor with the School of Electrical Engineering. From 1998 to 1999, he was a Visiting Professor with the University of California at Riverside, Riverside, CA, USA. He is also with the Automation and Systems Research Institute, the Engineering Research Center for Advanced Control and Instrumentation, and the Automatic Control Research Center, Seoul National University. His current research interests include adaptive and learning systems, visual surveillance, motion pattern analysis, object detection and tracking, and pattern learning and recognition. VOLUME 9, 2021