A Scene-Text Synthesis Engine Achieved Through Learning From Decomposed Real-World Data

Scene-text image synthesis techniques that aim to naturally compose text instances on background scene images are very appealing for training deep neural networks due to their ability to provide accurate and comprehensive annotation information. Prior studies have explored generating synthetic text images on two-dimensional and three-dimensional surfaces using rules derived from real-world observations. Some of these studies have proposed generating scene-text images through learning; however, owing to the absence of a suitable training dataset, unsupervised frameworks have been explored to learn from existing real-world data, which might not yield reliable performance. To ease this dilemma and facilitate research on learning-based scene text synthesis, we introduce DecompST, a real-world dataset prepared from some public benchmarks, containing three types of annotations: quadrilateral-level BBoxes, stroke-level text masks, and text-erased images. Leveraging the DecompST dataset, we propose a Learning-Based Text Synthesis engine (LBTS) that includes a text location proposal network (TLPNet) and a text appearance adaptation network (TAANet). TLPNet first predicts the suitable regions for text embedding, after which TAANet adaptively adjusts the geometry and color of the text instance to match the background context. After training, those networks can be integrated and utilized to generate the synthetic dataset for scene text analysis tasks. Comprehensive experiments were conducted to validate the effectiveness of the proposed LBTS along with existing methods, and the experimental results indicate the proposed LBTS can generate better pretraining data for scene text detectors. Our dataset and code are made available at: https://github.com/iiclab/DecompST.


I. INTRODUCTION
D EEP neural networks have demonstrated remarkable suc- cess in the field of scene text detection and recognition, yet their performance heavily depends on the quantity and quality of the labeled training data.However, manual collection and labeling of images are costly in terms of both time and resources, and automatic data generation is expected.The image synthesis technique that composes text instances on background images offers a cost-effective and scalable alternative to manual annotation, and this approach has attracted increasing interest in the computer vision community.
Various approaches have been investigated in the development of generation engines for synthetic scene-text images.Initially, based on the observation of real-world data, a set of sophisticated rules has been proposed to guide the design This work was supported in part by the Japan Society for the Promotion of Science (JSPS) KAKENHI under Grant 19K12033 and Grant 22H00540.
The authors are with the Graduate School of Engineering, Tohoku University, Sendai, 980-8579, Japan.(E-mail: tzm@dc.tohoku.ac.jp, tomo@tohoku.ac.jp, machi@ecei.tohoku.ac.jp)Fig. 1.Concept of our proposal.We first decomposed the real-world scenetext image into a text layer and a background layer.Next, we applied data augmentation to perturb the geometry and color of the text layer.Then, we proposed a Learning-based Text Synthesis Engine to recompose the two layers back to their original natural relationship, so that the engine can learn the complicated layout and appearance of text instances from real-world scene images of generation engines.Gupta et al. [1] and Zhan et al. [2] generated synthetic text images from two-dimensional (2D) background images based on different strategies such as region selection, text warping, and text color matching.Liao et al. [3] and Long et al. [4] further proposed rendering text on the surface of models in three-dimensional (3D) virtual worlds using Unreal Engine.Although realistic occlusions, perspectives, and illuminations can be realized in 3D engines, there is still a gap between the virtual and real worlds.To eliminate heuristic rules and complex setups, Yang et al. [5] proposed a learning-based method consisting of a location module and an appearance module.The location module employs a conditional variational auto-encoder (cVAE) [6] to learn the distribution of text locations directly from the original scene-text image and corresponding text bounding boxes (BBoxes).During training, the cVAE takes a scene text image as input, while during inference, a pure background image is used as input.The "condition" is changed during the training and inference process, which is unreasonable and may limit its performance.
In this study, we aim to address the challenge of inadequate training data and facilitate learning-based text synthesis methods.To this end, we propose the DecompST dataset, which enables the decomposition of real-world scene text images into pure background images and pure text instances.
These decomposed data can be utilized to train robust neural networks to learn the complicated layout and appearance of text instances in real-world scene images.The overall concept is illustrated in Fig. 1.Building upon the DecompST dataset, we propose a Learning-Based Text Synthesis engine (LBTS) that mainly includes a text location proposal network (TLPNet) and text appearance adaptation network (TAANet).TLPNet first predicts suitable regions from the background images for text embedding.TAANet then adaptively changes the perspective and color of the synthetic text instance to match the background.Once the networks have been effectively trained, an integrated data generation pipeline can be built to produce a scalable volume of synthetic data, which can subsequently be utilized as training data for various scene-text analysis tasks.
The main contributions of our study are summarized as follows: • We introduce the DecompST dataset, which is able to decompose real-world scene-text images into separate pure background images and text instances, for the training of learning-based scene-text synthesis methods.• We propose a learning-based scene-text image synthesis engine (LBTS) that consists of a text location proposal network and a text appearance adaptation network, to generate realistic synthetic scene-text images.• The quality of our generated dataset and other existing synthetic datasets is evaluated by the performance of a baseline text detector.The experimental results demonstrate that our method can generate better pretraining data for scene text detectors than other state-of-the-art methods.The structure of this paper is organized as follows.Section II reviews related studies on scene-text detection, image synthesis, and data augmentation for scene-text analysis.Section III provides details about the proposed DecompST dataset.Section IV introduces the proposed method, including the flow of data preprocessing and the structure of the two networks.In Section V, we evaluate and compare our proposed method with related synthetic datasets based on experimental results.Finally, concluding statements are presented in Section VI.

A. Scene Text Detection
With the rise of deep learning, scene text detection has been dramatically reshaped and facilitated, showing promising performance compared to traditional manual feature engineering algorithms [7]- [11].Recent learning-based scene text detection methods have been inspired by general object detection and image segmentation methods, which can be roughly categorized into regression-based and segmentationbased methods.Regression-based methods aim to predict the bounding boxes of text instances directly.TextBoxes [12] modified the anchors in the SSD [13] to handle text with various aspect ratios.CTPN [14] combines the framework of Faster R-CNN [15] with a recurrence mechanism to predict the contextual and dense fixed-width proposals of text.RRPN [16] proposes a rotation region proposal based on Faster R-CNN to bind arbitrary-oriented text with rotated rectangles.EAST [17] proposes a simplified detection pipeline that directly regresses rotated rectangles or quadrangles of text without using anchors.LOMO [18] improved the performance of EAST on the long text and arbitrarily shaped scene text by iteratively refining the preliminary proposals and considering the geometric properties of scene text.
Segmentation-based methods usually first extract text from the segmentation map and then compute the text bounding boxes by post-processing.Zhang et al. [19] integrated semantic labeling using FCN and MSER for pixel-level multi-oriented text detection.The Mask textspotter [20] was inspired by the framework of Mask R-CNN [21] and performed characterlevel instance segmentation for each alphabet; thus, it has the ability to detect and recognize irregular text.TextSnake [22] proposed a novel and flexible representation of arbitrarily shaped text and predicted heat maps of text centerlines, text regions, radii, and orientations to extract text instances.PSENet [23] gradually expanded small text kernels to complete shapes using multiple segmentation maps to effectively split close text instances.Liao et al. [24] proposed a differentiable binarization (DB) module in a simple segmentation network to perform binarization.CRAFT [25] exploited the affinity between characters in the form of a heat map and proposed a weakly supervised framework to estimate character-level ground truths in existing real word-level datasets.ACE [26] proposed to evolve the key points of the horizontal bounding box towards the corner points to detect arbitrarily-oriented objects or text.

B. Image Synthesis
Inserting foreground objects into a background image is one of the most common image synthesis approaches for generating a photo-realistic composite image, which may face inconsistency problems between the foreground and background in the geometry and appearance domains.To solve these inconsistency problems, many subtasks have been investigated, such as object placement, image blending, image harmonization, and shadow generation.Before the deep-learning era, many researchers explored automated image blending and harmonization.These methods transfer the color from one image to another based on the low-level statistics of the images, such as color distribution or histograms [27]- [29], gradient-domain information [30]- [33], and multi-scale statistical features [34], among others.
With the emergence of neural networks, more challenging tasks have been investigated.ST-GAN [35] seeks the geometric realism of image compositing by integrating a generative adversarial network (GAN) and spatial transformer networks (STNs) [36] to warp the foreground object in an iterative fashion.SF-GAN [37] combines an STN and CycleGAN [38] to perform geometry transformation and appearance domain translation concurrently with an end-to-end trainable network.Benefiting from the designed structure, the SF-GAN can also achieve synthesis realism in both geometry and appearance spaces without using paired training data.GCC-GAN [39] was proposed to address geometric and color consistency in composite images by integrating four subnetworks: a transformation network, a refinement network, a discriminator network, and a segmentation network.In the transformation network, not only are the parameters of the transformation matrix predicted, but the parameters of linear color transformation that control the contrast and brightness are also predicted simultaneously.Tsai et al. [40] introduced an endto-end image harmonization network with a shared encoder and two decoders, where the learned semantic information was used to facilitate harmonization.Inspired by AdaIN [41], Ling et al. [42] treated image harmonization as a backgroundto-foreground style transfer problem and proposed a plugand-play region-aware adaptive instance normalization (RAIN) module that explicitly formulates the visual style from the background and adaptively applies it to the foreground.

C. Data Augmentation for Scene Text Analysis
The text synthesis technique, which involves inserting text instances into scene background images, was initially investigated as a data augmentation approach for the training of scene text detection and recognition models.Later, synthetic datasets were utilized as important training data for other tasks such as scene text segmentation [43], [44], scene text erasing [45], [46], and scene text editing [47], [48].
Wang et al. [49] generated a character-centered synthetic image to train a character-level scene-text recognition model.Jaderberg et al. [50] generated a word-centered synthetic dataset using a set of predefined random processes, including font selection and rendering, bordering/shadowing and coloring, layer composition, projective distortion, blending, and noise addition.SF-GAN [37] was trained without paired data because of its unsupervised pipeline, which can also be applied in text synthesis tasks to generate patch-level synthetic text images.Yim et al. [51] further analyzed existing synthesis techniques [1], [50] and integrated the effective parts as a new-generation engine for scene text recognition tasks.These methods generate text-centered images, whose applications are limited.
Gupta et al. [1] first attempted to synthesize text in the wild to generate the SynthText dataset, which is beneficial for training scene-text detection tasks.The SynthText engine finds suitable text embedding regions in the background image following a set of rules that consider semantic segmentation maps and depth maps, and it renders text instances with color selection, perspective distortion, and Poisson blending [30] according to the local background information.Zhan et al. [2] exploited saliency-guided "semantic coherent" image synthesis by leveraging the annotations of semantic segmentation map and visual saliency map.They also designed an adaptive text appearance mechanism to determine the color and brightness of texts by matching a list of pairs, which includes the HoG feature of the background and LAB space statistics of text, gathered from real scene-text images.Yang et al. [5] proposed a learning-based, data-driven text synthesis engine by dividing the text synthesis into two sub-tasks:1) determining the location of text and 2) making the appearance of the inserted text more realistic.A conditional variational autoencoder [6], [52] was utilized to learn the distribution of text locations from real-world data, and a masked Cycle-GAN [38] was proposed to translate the appearance of synthetic images to the real-data domain.In contrast to rendering text in 2D static images, Long et al. [3], [4] renders text and the scene as integrity in 3D virtual worlds using the Unreal Engine.In this way, real-world variations, including complex yet correct perspective distortions, various lighting conditions, and occlusions, can be realized in the synthesized scene text images.
In terms of learning-based methods for synthesizing scenetext images, our method is closely related to the method proposed in [5].Their approach samples latent vectors from the prior distribution and feeds them to a cVAE to directly output the affine transformation parameters, which are used to globally transform the location and perspective of text instances.However, owing to the direct use of scene text images and the corresponding text BBoxes for training, the "condition" of cVAE is changed during the training and inference processes, which may achieve unsatisfactory performance.Our proposed DecompST dataset can address this problem by providing a data pair of text-erased images and original text BBoxes.
Another closely related method is presented in [37], which can concurrently achieve realism in both geometry and appearance spaces without supervision by employing an innovative network structure.In addition, the method in this study can generate patch-level synthetic text images for scene-text recognition tasks.In contrast to their work, our proposed method is a fully supervised image synthesis method that leverages the DecompST dataset, aiming to train more robust networks to generate image-level synthetic scene-text images specifically for the text detection task.

III. DECOMPST DATASET
We introduce a dataset called DecompST, which is a quadruplet of the original scene-text images, text BBoxes, text-erased images, and stroke-level text masks.This dataset can decompose real-world scene-text images into pure background images and text instances, as shown in Fig. 2. Those components can be utilized to train a robust network to learn the complicated layout and appearance of text instances in

A. Image Collection
All the images in our dataset were collected from several public real-world scene text detection benchmarks, including the ICDAR-2015 [53], MLT-2019 [54], and TextSeg [55] datasets.The ICDAR-2015 [53] and MLT-2019 [54] datasets are classic benchmarks for scene text detection.The TextSeg [55] dataset, on the other hand, specifically focuses on scene text segmentation.It provides comprehensive annotations encompassing quadrilateral BBoxes at both word and character levels, along with pixel-level text masks.We opted to use the TextSeg dataset because its manually-labeled, high-quality pixel-level text masks align with our requirements for strokelevel text masks.For each dataset, we collected both the training and validation sets, but we only selected Latin and Chinese parts of the MLT-2019 [54] dataset, and the sceneimage part of the TextSeg [55] dataset.

B. Annotation Details
This section provides a detailed description of the annotation process applied to create the DecompST dataset.For each text instance in the collected images, our goal was to obtain the corresponding text-erased patch and stroke-level text mask.Since the text instances in images are already labeled by BBoxes, we utilized a word-level scene-text-erasing method [46] to erase each text instance individually and generate texterased images.To obtain the stroke-level text mask of the ICDAR-2015 [53] and MLT-2019 [54] datasets, we employed the stroke mask prediction module (SMPM) in [46] to extract the pixel-level text mask.However, as the original SMPM was designed to predict a dilated text mask, we retrained the SMPM using the same synthetic dataset [46], but with original-size text masks as ground truth.Subsequently, this retrained SMPM was utilized to accurately predict text masks that precisely fit the text instances.Given that predictions made by neural networks can sometimes be imperfect, it is necessary to manually label the quality of predicted results.
Our labeling criteria for text-pixel images focused on the readability of text and the integrity of the text mask.As for text-erased images, we assessed the quality based on the effectiveness of text erasure and the restoration of the background.During the annotation process, the annotators checked the text-pixel image and text-erased image of each text instance and labeled both their quality as 1 or 0, where 1 indicated good and 0 indicated bad.Only text instances that received 1 on both sides were considered valid data, and other data were deemed invalid.For the TextSeg dataset, because accurate pixel-level text masks were provided, all text masks were labeled as 1, and we only assessed the quality of the text-erased image, assigning a label of 1 or 0.
Finally, the DecompST dataset contains 4585 images with 16017 valid text instances with corresponding text-erased images, stroke-level text masks, and quadrilateral bounding boxes, as summarized in Table I.Visual samples of annotated instances from the DecompST dataset are presented in Fig. 3.

IV. METHODOLOGY
In this section, we present our proposed learnable text synthesis (LBTS) method, which mainly consists of two subnetworks: the text location proposal network (TLPNet) and the text appearance adaptation network (TAANet), as illustrated in Fig. 4.More concretely, during the training, given a texterased image, TLPNet first predicts suitable regions for text embedding.Then, a perturbed text layer is added and TAANet adaptively adjusts the perspective and color of the perturbed text layer to restore its original natural appearance.After training, we can feed two networks with unseen background images and plain text images to generate synthetic scenetext images.Further details regarding the network structure, training process, and inference strategy are presented in the following subsections.

A. Text Location Proposal Network 1) Data Preprocessing in training:
Undoubtedly, the regions within the original BBoxes can be regarded as the ground truth of the text region for learning.Furthermore, we consider that the feasible region for text embedding could be extended if the background shares a similar pattern in a neighboring area, especially in the case of scene text that usually appears in relatively plain regions, such as billboards, walls, and signs.To identify the regions that have a similar appearance to the text-erased regions, we adopted the concept of the appearance descriptor and appearance distance from InstaBoost [56] to measure the appearance similarity between text-erased regions and all other regions within an image.The appearance descriptor D(•) is a combination of three weighted regions R i of each valid text instance in the text-erased image, which is related to the corresponding text location: where R 1 denotes the region of the stroke-level mask, and R 2 and R 3 are the dilated contours of the stroke-level mask with different scales (R 2 is the inner contour), given p x , p y as the center of the instance.w i is the weight coefficient of R i , and w 1 > w 2 > w 3 is defined to emphasize the higher similarity around the inner neighboring areas of the original text instance.Fig. 5 (b) shows some examples of visualizations of the descriptor's region R i and weight w i .Next, given a target text appearance descriptor D t (p tx , p ty ), we assess the appearance similarity between the appearance descriptor of each pixel in the text-erased image and D t using the appearance distance.The appearance distance for a given pixel (x, y), conditioned on D t , can be formulated as follows: where BBOX is the area inside the original text BBox.I(x, y) denotes the RGB value of the text-erased image on (x, y) pixel coordinates, and ∆ is the Euclidean distance.The result of ∆ is counted as infinity if (x s , y s ) is outside the boundary of the text-erased image.
By gathering the appearance distance of each pixel conditioned on the target text instance, we construct the target text appearance distance map H t d .H(x, y) denote the value of the map H at pixel coordinates (x, y).Consequently, H t d (x, y) = d Dt (x,y) .We generate the corresponding appearance consistency heatmap H t a by applying a normalization function to every pixel of the H t d , expressed as follows: here, d max is the maximum value in H t d except the infinity.During the calculation of Eq. 3, the infinity is set to d max .
For each text instance in an image, we calculate the corresponding appearance consistency heatmaps and combine them into H a : where k ∈ W is the index of the text instance and W denotes the set of valid text instances in the text-erased image.Up to this point, the appearance consistency heatmap H a only takes into account the color similarity between patches of valid text instances and other patches in a text-erased image.Therefore, H a is redundant and lacks semantic information.To address this limitation, we propose a further processing method for H a by incorporating semantic information provided by the edge map.First, we compute the difference between the heatmap H a and the Sobel edge map.This operation can divide H a with edge information, while it also may disrupt the original BBOX regions.To ensure the original BBOX regions are completely preserved in the result, we use the following operation: where I is the text-erased image, and Sobel is the Sobel edge detection operation.λ is the weight required to balance the segmentation degree.H BBOX is a heatmap in which pixels inside the valid text BBoxes are set to 1.0; otherwise, 0.
Then, the heatmap H e is further segmented using thresholding and we obtain H t : where T denotes a constant threshold.In our implementation, T and λ were set to 0.75 and 5.0, respectively.Next, we compute all connected components in H t and mark them as S j , where j is the index of each segmented region.We filter out small regions and regions that do not contain a high appearance consistency score in S j to ensure final text insert regions are the extension of the BBOX regions.Finally, we set the values of pixels inside remaining S j to 1 and inpaint the small holes to generate the final heatmap H f as the ground truth for the training of TLPNet.The processing flow of the appearance consistency heatmap is shown in Fig. 5. Through our preprocessing, BBox-based text regions are extended into semantic-based ones by considering the similarity of the regions' appearance.
2) Network Structure of TLPNet: Given a background image I bg , TLPNet aims to segment the mask of the text region H f , which is suitable for text embedding.We adopted the segmentation head of the DB [24] and used ResNeXt-50 [57] as the backbone for our TLPNet, which is illustrated in Fig. 6.During training, we used a binary cross-entropy (BCE) loss and a DICE loss.
where S and T represent the prediction and ground truth of the mask image, respectively, and N denotes the total number of pixels in the image.Ĥf and H f are the prediction and ground truth of TLPNet, respectively.λ 0 is set as 10 in our implementation.

B. Text Appearance Adaptation Network
We consider that the realism of text appearance has two aspects: proper perspective and harmonious color that align with the background context.To address this, our TAANet comprises 1) a geometry transformation module (GTM) and 2) a color harmonization module (CHM), as illustrated in Fig. 7.For the GTM, there are three inputs: a patch-level plain text The first step of preprocessing is to cut off the target text instance from the text-pixel image and apply a perspective transformation to warp the target text instance into a rectangular one based on its quadrilateral-BBox annotation so that we obtain a horizontal text instance without perspective.Sequentially, we augment the data by randomly altering the aspect ratio of the rectangle BBox and jittering the center of the rectangle BBox, to further perturb the geometric relationship between the target text instance and the background.Next, the text pixels of the target text instance are clustered in only two or three colors using K-means to remove color information and noise.In addition, we augment the data by jittering the color of the text in the HSL space.Finally, to reduce the  interdependence between text instances within an image, other text instances are randomly erased in the background image.The entire process flow is shown in Fig. 8.
Based on the aforementioned processing, we can obtain the reference rectangle Rect, patch-level plain text image P pt , background image I bg , and ground truth of the transformation matrix A n using the following operations.Rect is a square box centered on the processed target text instance.Using Rect, the target text instances before and after processing are cropped, resized, and padded to create P pt and the text image before processing P before .A n is computed based on the transformed BBox in P pt and the original BBox in P before .Moreover, P pt is a five-channel image with RGB channels P ptRGB , alpha channel P ptA , and a mask channel of the BBox-level of the text region P ptBM , as shown in Fig. 9. P ptBM is utilized as additional information during network training, which will be discussed in the later section.Finally, I bg is generated by composing the remaining text in the processed text-pixel image and text-erased image.
2) Geometry Transformation Module (GTM): The first step of the GTM is to feed Rect into a spatial transformer module (STM) [58] and generate a transformation matrix A m parameterized by θ m .The A m is used to warp and pad the patchlevel plain text image P pt into the plain-text image I pt .Then, the I pt and background image I bg are concatenated and fed into the localization network (ResNet-34 [59]) to regress the parameters θ n of the homography transformation matrix A n .Once the transformation matrices A m and A n are obtained, they are applied to the P pt to sample the transformed text image I tt .In the GTM, A m is used to determine the coarse location and scale of the text based on the Rect, and A n is used to transform the local perspective of the text instance.The transformation is expressed as follows: where T θ is a 2D perspective transformation and G i is a pixel in a regular grid G, which is the same as the grid in P pt .Therefore, G i = (x pt i , y pt i ), which are the coordinates of P pt , and (x tt i , y tt i ) are the corresponding coordinates in the warped grid that defines the sample points.
where S represents the differentiable bilinear sampler [36] that computes the pixel value of I tt by interpolating the corresponding neighbor pixels in P pt .
After obtaining the transformed text image I tt and background image I bg , we can compose them to obtain I comp : where • is the Hadamard product.I ttRGB and I ttA are the RGB channels and pixel-level alpha channel of I tt .
During the training, we introduce three loss functions to stabilize the training of the geometry transformation module: local L1 loss, global region loss, and adversarial loss.We use a robust smooth-L1 loss [60], as the local L1 loss directly restricts the output of the localization network from a numerical perspective: where Ân and A n represent the prediction and ground truth of the localization network output, respectively.The region loss employs the DICE loss in Eq. 8 to guide the transformed text with a higher overlapping rate from the view of the region, and we globally apply it to the stroke-mask level and BBox-mask level in the image: (15) here, ÎttA and ÎttBM are generated by transforming P ptA and P ptBM using the matrices A m and Ân .I ttA and I ttBM are the corresponding ground truths that can be easily generated from stroke-level text masks and text BBoxes.
GAN [35], [37], [39] has been proven beneficial for the training of STN, so we adopt it in our implementation.However, we do not directly use the source image I s as the "real image" in adversarial training because I s is realistic in both the geometry and color spaces.Instead, we generate I comp by warping I pt using A n , which only achieves realism in the geometry domain.I comp is treated as a "real image" during the training of the GTM.The adversarial loss is defined as follows: where I comp and I ttA are concatenated as the inputs of the discriminator.λ 1 and λ 2 are set to 50 and 10, respectively, in our experiment.
3) Color Harmonization Module (CHM): We treat this text-color-changing task as an image-harmonization problem.We employ the region-aware adaptive instance normalization (RAIN) module [42] in a UNet-like architecture by adding RAIN modules after the convolutional layers in the decoding stage.RAIN is proposed as an activation function that normalizes the foreground features and aligns the normalized features with a computed scale and bias from the background features.In our task, we hope that it can transfer the style from the background into text instances, maintaining harmony between texts and the background.Given an input feature batch F ∈ R C×H×W and resized foreground (text) mask M ∈ R H×W , the formulation of RAIN(•) is expressed as: where µ(•) and σ(•) ∈ R C are the channel-wise mean and standard deviation of the foreground or background features, respectively, computed independently across spatial dimensions for each channel.
) where • denotes the Hadamard product.
In addition, we adopt an adversarial training method.Adversarial loss can be expressed as follows: Here, λ 3 is set to 5 in the experiment.4) Inference Pipeline: After training the TLPNet and TAANet, they can be integrated into a generation pipeline to generate synthetic data.The inference process of our method is illustrated in the lower section of Fig. 4. Given a background image, We first use TLPNet to predict the text regions in the form of heatmaps.Subsequently, we randomly sample a reference rectangle with a higher 70% overlap rate with the text regions.At the same time, a plain text patch image with a size of 256 × 256 is generated by randomly selecting fonts, text, and color.Then, the reference rectangle, plain text patch image, and background image are passed through TAANet to produce a synthetic text image.Finally, post-processing applies various effects to the text, including shadows, 3D effects, texture, and blurring.In the composition of the multiple text instances within one background image, we abandon overlapped and small text instances.In the presence of semantic information, such as in the COCO dataset [61], the refinement of the synthesis can be achieved by discarding the text beyond the boundaries of semantic segmentation, allowing for the synthesis of text instances specifically on designated objects.

V. EXPERIMENT
A. Implementation Details 1) Training Configurations: Our implementation was based on the PyTorch framework.For training of TLPNet, we used the DecompST and the SCUT-EnsText datasets [62] to generate the training data pairs.As a result, we obtained a total of approximately 7900 training data pairs.The input size of the TLPNet was set to 768 × 768, and the batch size was 12 on an Nvidia GeForce RTX 3090 GPU.We employed the Adam [63] optimizer with a β of (0.5, 0.9), and the learning rate started at 0.0002 and decayed to nine-tenths after every 20 epochs in the training phase.During the training of TAANet, GTM and CHM were trained separately.This is because we adopted the L1 loss during the training of CHM, which is essential for effectively constraining the color of the output.The input size of TAANet was also 768 × 768, and the training batch size for GTM and CHM were set to 20 and 10, respectively, on a single Nvidia GeForce RTX 3090 GPU.The optimizer used was the same as in TLPNet, and the discriminators' learning rate started from 0.0004, with the same decay rate as that in TLPNet.
2) Inference Configurations: In the preparation stage, we need to collect some ingredients for synthesis, including background images, fonts, and a lexicon.The background images were collected from the COCO dataset [61] and Places2 dataset [64].To ensure that the images closely resembled real scene images, we selected the image sets by excluding those with labels related to natural landscapes.Additionally, we applied filtering to the selected image sets using CRAFT [25] and DB [24] to remove any images with prominent text.Ultimately, we amassed a collection of approximately 200,000 background images.Furthermore, we gathered around 2000 fonts and compiled a lexicon by combining the MJ dataset [50] and the ST dataset [1].Our LBTS dataset is generated by a machine with a single GeForce RTX 3080 GPU, AMD Ryzen7 3700X @ 3.6 GHz CPU, and 32G RAM.The TLPNet model consists of 24.7M parameters, while the TAANet model has 38.5M parameters (21.4M for GTM and 17.1M for CHM).The inference times for TLPNet on a single image and TAANet on one text instance are approximately 21ms and 81ms (11ms for GTM, and 70ms for CHM), respectively.Fig. 10 shows some generated samples from our LBTS dataset.We observed that the TLPNet exhibited a preference for predicting the text region in relatively flat areas, especially in regions with quadrilateral shapes.This tendency may stem from the bias in the training data, where most text instances exist on the signs, walls, or billboards.On the other hand, the geometry and color relationship between text and background is also reasonably aligned by the TAANet.The text perspective accurately follows the boundaries of text regions, and the text color is appropriately balanced, neither being obtrusive nor excessively dull.

B. Evaluation Metrics and Datasets 1) Evaluation metrics:
To verify the effectiveness of different text synthesis methods, a common method is to train the same text detector on different synthesized datasets and evaluate the trained detectors on several test sets of real datasets.The better performance of the text detector indicates a higher quality of the training data, implying a better text synthesis strategy.Following previous works [2], [4], synthetic datasets are evaluated from two perspectives: 1) as independent training data for detection models to assess the possibility that whether synthetic datasets can be a substitute for realworld datasets.2) as pretraining data to initialize text detectors, where pretrained models fine-tuned with real-world data usually exhibit better performance than models directly trained from scratch with real-world data.
In our experiment, we selected EAST [17] and DB [24] as the baseline text detector to conduct comparison experiments.Both of them were previous state-of-the-art methods and are the most commonly used algorithms in the text detection task.In the implementation of EAST, ResNet-50 [59] was used as the backbone, and all the models were trained on two RTX 2080Ti GPUs with a batch size of 28.For DB, we trained DB-ResNet-50 [24] on one RTX 3090 with a batch size of 20.The performance metrics of the text detector, recall (R), precision (P), and F-score (F), were calculated under the ICDAR2015 evaluation protocol [53] over all evaluation datasets.
2) Synthetic Dataset: • Oxford SynthText Dataset (ST) [1] is a large-scale synthetic text dataset that consists of about 850,000 images.It is created from about 8000 background images and 1200 fonts.10,000 data pairs were randomly sampled from this dataset to compose ST-10k.• Verisimilar Image Synthesis Dataset (VISD) [2] contains 10,000 images synthesized from background images collected from the COCO dataset [61].• UnrealText (UT) [4] initially consists of about 728,000 images in English/Latin.However, we discovered that some of these images either do not contain text or are partially black, potentially due to render failure or incorrect camera positioning.To ensure data quality, we filtered out the images without annotations and those where more than two-thirds of the pixels are completely black.As a result, approximately 670,000 images remained, and we also randomly sampled 10,000 images to form UT-10k for our experiment.3) Real-world Dataset:

C. Comparison with State-of-the-Art Methods
To verify the effectiveness of the proposed text synthesis engine, we conducted evaluation experiments to compare our generated LBTS dataset with those of recent state-of-the-art approaches [1], [2], [4].First, we standardized the total number of each synthesis dataset to 10k to conduct a fair comparison experiment.We trained EAST on each synthetic dataset with 200,000 steps, followed by fine-tuning on the corresponding real-world training set for an additional 200,000 steps.The performance of EAST was evaluated by the validation set of each real dataset every 1000 steps, and the best F-scores are recorded in Table II.For all the evaluation benchmarks, when we employed synthetic datasets as independent training data, EAST trained on VISD-10k achieved the highest F-score and Recall, and EAST trained on UT-10k achieved higher Precision.However, when we fine-tuned the pretrained EAST with real-world data, we observed that our LBTS-10k dataset outperformed all other synthetic datasets, obtaining 0.31%, 0.95%, and 0.36% improvement of the F-score on IC13, IC15, and MLT17 datasets over VISD-10k.
We also trained DB in a similar manner to compare the quality of synthesis datasets.Initially, DB was pretrained on each synthetic dataset for 100,000 steps and then fine-tuned on the IC15 or Total-text datasets for another 1200 epochs.During the training, we validated the model with the corresponding test set every 2000 steps, and Table III presents the best Fscores obtained.The results showed that using DB as the baseline detector yielded similar results as using EAST.When considering synthetic datasets as independent training data, the VISD-10k achieved the highest F-score for both IC15 and Total-Text datasets.However, by further fine-tuning the DB model, pretrained on synthetic data, with real data, our LBTS-10k dataset obtained a higher F-score than other datasets.Compared to the F-score of DB trained from scratch, DB pretrained with our dataset gained 2.4% and 1.83% on IC15 and Total-Text, respectively.Furthermore, in comparison to previous state-of-the-art datasets, we observed a commendable improvement of 0.53% in F-score on IC15, while achieving competitive performance on the Total-Text dataset.To verify the robustness of each synthetic dataset, three random samples of 10k data were extracted from each full-size dataset.These sampled 10k datasets were then used to conduct the evaluation experiments on IC15 using DB.The average F-measure for ST-10k, VISD-10k, UT-10k, and LBTS-10k were 86.24, 86.11, 86.39, and 86.78, respectively.The corresponding variances in F-measure were 0.017, 0.004, 0.019, and 0.015, indicating our LBTS datasets achieve consistently high performance across multiple samples.Fig. 11 displays some visual comparisons of baseline detectors with and without LBTS pretraining.Pre-  trained models effectively reduce detection errors and exhibit enhanced robustness in handling complex text instances.
To the best of our knowledge, this is the first report that highlights the performance discrepancy resulting from the use of synthetic datasets during the pretraining and fine-tuning stages.In our perspective, synthetic datasets play different roles when employed as independent training data or as pretraining data.When text detectors are solely trained on synthetic datasets and evaluated on real datasets, the performance of the text detector indicates the level of entangled "realism" between the synthetic dataset and real data to a certain extent.We believe that the realism of text encompasses multiple dimensions, such as text appearance, distribution, font, lighting conditions, and background image types.Both existing methods and our proposed LBTS approach impose constraints on the generated synthesis data in these dimensions to approximate the real-world domain.Those constraints are usually divided into several rules and steps based on prior knowledge.The "realism" we mentioned here denotes the degree of entangled "realism" achieved based on these constraints.
However, when synthetic datasets served as pretraining data, we hypothesize that dataset diversity becomes more crucial than "realism".[68] is one extreme case that the models can be well pretrained without natural images.Synthetic data with greater diversity may enable convolutional layers to learn distinctive representations.These representations' corresponding model weights are activated and reinforced if they are beneficial during the fine-tuning phase, thereby preventing the model from becoming trapped in local minima during gradient descent.We consider that the learning mechanism implemented in our LBTS engine introduces a greater degree of diversity compared to rule-based methods, resulting in our generated data performing better as pretraining data.
In addition, we created mixed synthetic datasets from different synthetic datasets to find out whether the data generated from different synthesis methods could play a complementary role during the training of the scene text detector.EAST was trained using the same configuration as the above experiment, and the evaluation results are summarized in Table IV.Without using real data, EAST achieved the best F-score when trained on VISD-5k + UT-5k, which was higher than the results obtained with VISD-10k or UT-10k individually.However, this synergetic effect disappeared when it served as pretraining data.The performance of EAST trained on the VISD-5k + UT-5k + Real is almost in the range of that achieved with UT-10k + Real to VISD-10k + Real, which cannot surpass the better performance between UT-10k + Real and VISD-10k + Real.
A similar approximately linear relationship can also be found in other mixed datasets, including LBTS.On the other hand.when the mixed data serve as the pretraining data, we found that EAST trained with VISD-5k + LBTS-5k + Real or UT-5k + LBTS-5k + Real, performed better than that trained with synthetic data from a single source, such as VISD-10k + Real or UT-10k + Real.Finally, we generated 100k synthetic images to test the scalability of our LBTS.We compared LBTS-100k with the full-size ST [1] and UT [4].We trained EAST with 300,000 steps on different full-size datasets; the other configuration was the same as the above experiments.The evaluation results of EAST are presented in Table V.We observed that the performance of EAST improved when the number of generated datasets increased.Furthermore, EAST trained on LBTS-100k + Real achieved a competitive performance compared with that trained on ST-850k + Real and UT-670k + Real.

D. Ablation Study
In this section, we investigated the effectiveness of different settings of the proposed data-generation engine.The text location proposal network (TLPNet), geometry transformation module (GTM), color harmonization module (CHM), and postprocessing were the focus.The evaluation results of the EAST trained on the datasets generated by different configurations on the ICDAR2015 dataset are reported in Table VI.
• Text Location Proposal Network Given a background image, TLPNet aims to propose suitable regions for text embedding, which are usually relatively plain areas, as depicted in Fig. 10.To investigate its significance, we conducted an ablation study in which we replaced the output of TLPNet with an image, whose pixels value are all set to 1.This means the texts can appear at any location within the background image.The evaluation result, presented in Table VI emphasizes that TLPNet improves the quality of the generated synthetic data whether they served as the sole training data or the pretraining data.• Geometry Transformation Module To assess the importance of the GTM, we replaced this module in our generation engine with a random transformation matrix generator.However, employing a completely random matrix generator is not advisable as it will heavily distort the text instances, resulting in extremely unrealistic results.For this reason, we adopted the random transformation matrix generator from a word-level SynthText engine [46] to reasonably transform the perspective of text instances, at least at the patch level.From Table VI, firstly, we observed that data generated with GTM serves as better independent training data and pretraining data for the text detector.This reveals the importance of our GTM function in the synthesis engine.Secondly, we noticed that when using synthetic data solely for training, there is a substantial performance gap between datasets generated w/o GTM and ALL.Nevertheless, this gap significantly diminishes when we incorporate real data for fine-tuning.This phenomenon further supports the conclusion drawn in the last subsection, highlighting that lower performance in the pretraining model does not necessarily lead to low performance in the fine-tuned model.• Color Harmonization Module To evaluate the advantages of the CHM, the color-deciding process in our engine was replaced with that of the SynthText engine [1], where the text color is determined by referencing a learned dictionary based on the background's local statis-tic information.We can observe that the performance of EAST decreased when our CHM was missing.• Post-processing To confirm the contribution of postprocessing of our engine, we generated a dataset without applying post-processing and evaluated the quality of this dataset.Table VI implies that our post-processing techniques can enhance data diversity and improve the overall quality of generated data.

E. Discussion
Based on our comprehensive experimental results, although we cannot explicitly determine the specific type of data that benefits the training of text detectors, we can summarize several findings that prior studies have not addressed.First, we discovered that the performance of a text detector trained on both synthetic and real data is not strictly positively correlated with that trained only on synthetic data, even if the performance gap of the synthetic data is large.Second, the integration of different synthetic datasets generally improves the performance of the text detector; however, the extent of improvement differs based on the utilization of the mixed synthetic datasets.When using mixed synthetic data as independent training data, better performance can be achieved than that of datasets from a single source.However, when real data are involved in fine-tuning, the performance of the mixed data fails to surpass the best performance achieved by the single source dataset.
Our generation engine has several limitations.Firstly, the performance of TAANet, especially the GTM, is heavily influenced by the results of TLPNet.There exists a gap between the training data and inference data in TLPNet, where text-erased images usually have relatively large and flat areas with strong leading lines, such as the edges of signage or billboards, but the inference data are usually more diverse.A poor prediction of the text region often results in an unsatisfactory final output for human perception.This is because the GTM struggles to reasonably transform the perspective of text instances when the leading lines are missing in the background image.Secondly, in our proposed TAANet, the forward process is based on one text instance, thus, our method neglects to model the relationship between text instances.We opted to abandon text instances that were too close or that intersected with other texts, as it is uncommon for text to overlap in the real world.However, this trick usually leads to a disorganized layout of text instances and a reduction in generation efficiency.We believe that a unified training and generation structure may improve the generation results, and we expect future studies to successfully address these problems for learning-based scenetext image synthesis tasks.

VI. CONCLUSION
In this study, we first propose a new scene text dataset called DecompST, which can decompose real-world scene-text images into pure background images and pure text instances using text-erased images and stroke-level masks.Leveraging the DecompST dataset, we introduce a learning-based scenetext image synthesis engine, termed LBTS, which comprises a text location proposal network (TLPNet) and a text appearance adaptation network (TAANet).TLPNet is a segmentation network, capable of predicting suitable regions for text embedding.It is trained with the data pair of text-erased images and the mask of text regions, where text regions were extended from GT BBoxes based on appearance similarity and boundary information.TAANet consists of a geometry transformation module and a color harmonization module.These components can adaptively adjust the perspective and color of the synthetic text instance to ensure compatibility with the background.By combining our trained TLPNet and TAANet, we have developed a synthetic scene-text image generation engine and verified the effectiveness of our generated dataset using two popular baseline text detectors.Comprehensive experiments demonstrated the effectiveness of our proposed method in generating pretraining data for scene text detection.

Fig. 2 .
Fig. 2. Given the original image and corresponding text BBoxes, we decompose real-world scene-text images into pure background images and text instances using a text-erased image and stroke-level text mask.

Fig. 3 .
Fig. 3. Some image samples from our proposed DecompST Dataset.The first row contains the original images with text BBoxes, where valid text instances are marked in green BBoxes and invalid ones are in red BBoxes.The second row is our generated text-erased images.The third row is the stroke-level text masks.The fourth row is the text-pixel images masked by stroke-level masks.

Fig. 4 .
Fig. 4. Pipeline of our proposed Learning-based Text Synthesis Engine (LBTS).It mainly consists of two networks: text location proposal network (TLPNet)and text appearance adaptation network (TAANet).Given a background image, TLPNet predicts suitable regions for text embedding.Then, TAANet aligns the geometric and color relationship between the synthetic text instance and the background.We trained our proposed networks on decomposed real-world data and applied them in the synthesis domain to generate synthetic scene-text images.

Fig. 5 .
Fig. 5. Flow of data preprocessing.(a) Text-erased image I and BBOX regions.(b) Visualization of R i , w i .The R i refer to the corresponding regions of each text instance and the brighter regions in (b) of R i mean higher w i .(c) Appearance consistency heatmap Ha.(d) Edge-based segmented heatmap He.(e) Final generated heatmap H f (H f is overlaid on I for a better view).The red regions are treated as GT during the training of TLPNet.Note that the original Ha, He, H f are gray-scale images; we visualized them as heatmaps in this figure.

Fig. 6 .Fig. 7 .
Fig. 6.Structure of the text location proposal network.Given a background image I bg , TLPNet aims to segment the text region, which should be as close to the heatmap H f .

Fig. 8 .
Fig. 8. Flow of the preprocessing of training data in TAANet.The blue and red dashed boxes are the same reference rectangle Rect but in the images before and after the processing to show the clipping regions to obtain patch text images.

Fig. 9 .
Fig. 9. Illustration of An and Ppt.The images in blue box P before and red box Ppt are obtained by cropping from the dash boxes with the same color in Fig. 8.

Fig. 10 .
Fig. 10.Several sample images generated by our proposed synthesis engine.The left column of the paired images displays the predicted text regions using TLPNet, and the right column of that is our synthesized image.

Fig. 11 .
Fig. 11.Visual comparisons of baseline detectors using pretraining.(a) Detection results of EAST trained from scratch.(b) Detection results of EAST pretrained with our LBTS dataset.(c) Detection results of DB trained from scratch.(d) Detection results of DB pretrained with our LBTS dataset.Zoom in for the best view.

TABLE I NUMBER
OF IMAGES AND VALID TEXT INSTANCES FROM DIFFERENT SOURCE DATASETS.
real-world scene images.We have made this dataset publicly available and hope that it can motivate more learning-based scene text synthesis methods to generate high-quality synthetic training data for scene text detection and recognition tasks.

TABLE II COMPARISON
BETWEEN PREVIOUS SYNTHETIC DATASETS AND OUR LBTS DATASET ON THE ICDAR2013, ICDAR2015, ICDAR2017MLT DATASETS USING EAST AS THE BASELINE DETECTOR.R: RECALL, P: PRECISION, F: F-SCORE, REAL: THE CORRESPONDING TRAINING SET OF THE EVALUATION

TABLE IV QUALITY
COMPARISON BETWEEN DIFFERENT MIXED SYNTHETIC DATASETS ON ICDAR2013, ICDAR2015, ICDAR2017MLT DATASETS USING EAST AS THE BASELINE DETECTOR.R: RECALL, P: PRECISION, F: F-SCORE, REAL: THE CORRESPONDING TRAINING SET OF THE EVALUATION

TABLE V QUALITY
COMPARISON BETWEEN DIFFERENT FULL-SIZE SYNTHETIC DATASETS ON ICDAR2013, ICDAR2015, ICDAR2017MLT DATASETS USING EAST AS THE BASELINE DETECTOR.R: RECALL, P: PRECISION, F: F-SCORE, REAL: THE CORRESPONDING TRAINING SET OF THE EVALUATION