Two-Step CNN Framework for Text Line Recognition in Camera-Captured Images

In this paper, we introduce an “on the device” text line recognition framework that is designed for mobile or embedded systems. We consider per-character segmentation as a language-independent problem and individual character recognition as a language-dependent one. Thus, the proposed solution is based on two separate artificial neural networks (ANN) and dynamic programming instead of employing image processing methods for the segmentation step or end-to-end ANN. To satisfy the tight constraints on memory size imposed by embedded systems and to avoid overfitting, we employ ANNs with a small number of trainable parameters. The primary purpose of our framework is the recognition of low-quality images of identity documents with complex backgrounds and a variety of languages and fonts. We demonstrate that our solution shows high recognition accuracy on natural datasets even being trained on purely synthetic data. We use MIDV-500 and Census 1961 Project datasets for text line recognition. The proposed method considerably surpasses the algorithmic method implemented in Tesseract 3.05, the LSTM method (Tesseract 4.00), and unpublished method used in the ABBYY FineReader 15 system. Also, our framework is faster than other compared solutions. We show the language-independence of our segmenter with the experiment with Cyrillic, Armenian, and Chinese text lines.


I. INTRODUCTION
Smartphones, tablet computers, and other mobile devices gain more and more popularity each day. Applications for such devices include government and commercial services that often require entering data from printed documents. Yet the text entry on modern touch-based keyboards is errorprone and time-consuming [1], [2]. Thus, several solutions appeared in recent years [3]- [6] for optical text recognition in images that are captured using mobile devices. These systems can be classified into two groups: client-server solutions, which transfer images to a ''cloud'' and require internet connection, and ''on the device'' methods that perform the recognition process without data transmission.
The associate editor coordinating the review of this manuscript and approving it for publication was Ah Hwee Tan .
Recognition of identity documents is a specific case since they contain sensitive personal information, and any application should guarantee the security of personal data. Consequently, it seems reasonable to say that the ''on the device'' methods fit better for their processing than the ''cloud'' solutions since the former present fewer security issues. The tight constraints on computational power and memory size imposed by the embedded and mobile systems limit the resources available and render many approaches unusable. Also, any noticeable delay while executing a smartphone application can affect user experience. Moreover, such a system faces a lot of challenges. Unlike invoices and other official papers, identity documents often have complex backgrounds. Besides, they usually have specific surfaces. For example, ID documents can be laminated, and, consequently, highlights can appear in images captured using VOLUME 8, 2020 This a smartphone. In addition, the process of image acquisition tends to introduce many distortions [7]- [9], which make methods designed for scanned documents images unsuitable for camera-captured ones [10]. Fig. 1 presents samples of camera-captured ID images from MIDV-500 dataset [11], which demonstrates typical input images for a recognition application on a mobile device. As a result, any competent approach to the embedded recognition of IDs should employ rather sophisticated methods and be efficient in the system with stringent resources. The process of ID recognition can be divided into a number of steps, such as document identification and location, zone extraction and rectification, per-field segmentation, field recognition, language model postprocessing, and result acquisition [6]. In our study, we consider the field recognition step, which includes text line detection and recognition (Fig. 2), and focus on the text line recognition part. As identity documents have the predefined structure, the algorithms for straight line identification [12] are used for text line detection in a text field image. The text line vertical position can be specified in different ways that are common for both printed and handwritten text lines [13]. In our experiments, we suppose that baseline and cap line (Fig. 3) approximate coordinates are found beforehand, and we get them as input.
In this paper, we present a method for text line recognition that can be used as a part of an on-the-premises recognition system for various printed documents including but not limited to identity cards, passports, and driving licences.

II. RELATED WORK
Most of the modern text line recognition techniques can be divided into two large groups: the ones with explicit percharacter segmentation followed by recognition and the endto-end solutions. In sections II-A and II-B we briefly describe the existing approaches from both groups.

A. RECOGNITION WITH EXPLICIT SEGMENTATION
Per-character line segmentation is a process of finding bounding rectangles of characters and is one of the essential problems of text recognition [12], [14]. Segmentation can be applied to binarized or grayscale images. However, in the case of camera-captured images, binarization can have a drastic effect of introducing background noise [7]. After the segmentation is performed, a classifier is used for OCR. The classification approaches vary greatly from pattern matching to the most modern architectures of neural networks. The majority of the recent studies focus on the segmentation part of these methods as it is believed to be more difficult than recognition [15].
Two main approaches to the per-character segmentation exist: projection analysis and connected components processing. The typical problem faced by the standard methods is the segmentation of either connected symbols or individual symbols consisting of several separate primitives, especially in the context of complex backgrounds and cameraspecific distortions. Most methods include various heuristics about possible glyphs [5], [16], [17], which allow cutting or merging components of specific shapes. To overcome all the difficulties, over-segmentation methods are developed.
These methods try either to evaluate different segmentation paths and select the best one [7] or to restore correct segmentation from an over-segmented line using dynamic programming [18]. But even these methods use prior information about the text. For instance, a method [7] is finetuned for text printed with a monospaced font and method [18] does not consider the existence of characters consisting of several primitives. Remarkably, methods based on oversegmentation often employ artificial neural networks (ANN) for result improvement in the following way: a classifier, e.g., ANN, is trained to distinguish a character from its oversegmented part and the classification results are taken into account for the choice of the best path [19], [20]. To our best knowledge, it is the only way ANNs are used in explicit per-character segmentation. It is an omission as the ANNs, especially fully convolutional networks (FCN), demonstrate the state-of-the-art results in the contiguous task of semantic segmentation [21], [22].
The current situation with per-character segmentation can be regarded as the one with the omnifont character recognition before the rise of deep learning. While it is possible to perform optical character recognition (OCR) exploiting features that are pre-defined in the algorithms without any ANNs, such approaches are usually outperformed with modern ANN-based methods. However, most of the modern segmentation methods work in the same way-with the predefined language-dependable features. And as ANNs have already influenced OCR, they can significantly improve percharacter segmentation.

B. END-TO-END RECOGNITION
Techniques without explicit segmentation, or end-to-end recognition, appeared due to the claim that the segmentation algorithms are highly error-prone for distorted images [23]. Modern segmentation-free text recognition methods are predominantly represented by various ANN-based methods and include the sliding-window classifiers [24] and the ones based on recurrent neural networks (RNN) [25], [26]. Nowadays RNNs demonstrate state-of-the-art results in the textin-the-wild recognition problem [27], [28], and are used in various OCR engines, e.g., Tesseract 4.00 [29]. The main advantage of RNNs is their ability to handle sequences of characters, but it can make them ineffective for identity documents recognition as these documents contain, firstly, fields without language model (document numbers) and, secondly, rare spellings of popular names which can be ''corrected'' by the classifier [30]. The other variant of segmentation-free recognition is the utilization of per-word classifiers, i.e., one class stands not for a character, but a whole word [31]. This approach faces two main problems: the impossibility of recognition of document fields without language model and an enormous number of the resulting classes.

C. ANNS IN ''ON THE DEVICE'' OCR SOLUTIONS
OCR is the core part of any text recognition framework. The results of ICDAR2015 competition on smartphone document capture and OCR [32] demonstrate that OCR in camera-captured images remains an open problem even for documents with simple backgrounds. At the same time, the extensive use of mobile devices makes OCR a ''musthave'' application on modern gadgets [33]. Since the introduction of a convolutional neural network (CNN) LeNet-5 and its results on the MNIST dataset [34], the OCR task is usually solved with various ANNs that demonstrate stateof-the-art results on public datasets for object classification [35]- [39]. However, to be usable, ANNs employed in ''on the device'' software should satisfy the tight constraints on computational power and memory size [40]. In particular, it is essential for multi-language applications that require several classifiers. Consequently, many deep architectures need re-thinking for such applications [41]. Moreover, such methods as model ensembling employed in [35] could become quite resource-intensive. Thus, a number of different approaches to ANN modifications were introduced in recent years. Some of them suggest the creation of efficient lightweight architectures and modification of the state-of-the-art ones [42]- [48], others introduce solutions with 8-bit fixed point or binarized weights [49]- [51]. It should be mentioned that a few hardware-aware and hardware-adapted solutions exist [49], [52], but we are not interested in them in our study as we need a solution for off-the-shelf systems. Most of the suggested architectures have hundreds of thousands of weights. As the recent studies claim that the majority of the modern networks have the excessive capacity and are prone to overfitting [53], we intend to use feed-forward networks that have a small number of trainable parameters and are suitable for embedded systems. The light-weight ANN architectures that we propose in our framework are based on the previous papers of the authors. For example, in [54]- [56] we showed the capabilities of the light-weight architecture in OCR problems, in [58] we employed a light-weight neural network for both optical font recognition and OCR, and in [59] the ability of a light-weight ANN to detect vanishing points was demonstrated.

D. TEXT LINE RECOGNITION TOOLSETS
Nowadays, a number of text recognition systems are available. Some of them are available as the on-premises solutions, and some provide the client-server interfaces. In our study, we limit the set of considered systems to the two in the most common use: Tesseract OCR and ABBYY FineReader. Tesseract OCR is a free OCR engine, and ABBYY FineReader is a commercial product.

1) TESSERACT OCR
Tesseract OCR [60] is one of the most popular open-source OCR engines [61], widely employed by both developers and users. According to [32], [61]- [64], this system is commonly used within the community as a baseline method for recognition quality evaluation. Nowadays, Tesseract is being developed by Google. To assess our method, we compare its recognition results with those of Tesseract OCR of VOLUME 8, 2020 versions 4.00 and 3.05. Tesseract OCR 4.00 was released on October 29, 2018, and employs an ANN with LSTM blocks. This ANN contains 7.8×10 5 weights and has been trained on a large amount of purely synthetic data [29]. While the whole training data is not available online, the provided examples and the generation process description show that Tesseract 4.00 can be used for reference [65]. We additionally provide the results of the previous version, i.e., Tesseract OCR 3.05, as, firstly, it allows comparison with earlier studies, and, secondly, it demonstrated competitive results with other methods [61], [62].

2) ABBYY FINEREADER
ABBYY FineReader [66] is a state-of-the-art commercial OCR application [61] that is used in both scientific studies [7], [32], [62], [65] and business. In our experiments, we used the latest ABBYY FineReader 15. FineReader is considered one of the best solutions available on the market, but its source code or the employed algorithms are not published yet. Judging by the information provided on the official website, we can presume that machine learning algorithms with language models are used for text recognition.

III. DATASETS
Datasets is a widely used mean for method evaluation and comparison of various methods. The importance of public datasets in studies cannot be overestimated as they allow not to reproduce previously published methods for comparison, and, what is more, allow the comparison with the stateof-the-art commercial systems that do not provide enough information for method reproduction.
Throughout our study, we want to evaluate both the proposed method and its separate components. As a result, we need three types of test datasets: one suitable for the segmenter evaluation, one for the classifier evaluation, and one for the full framework evaluation. To evaluate character classification, we used the famous MNIST dataset (Sec. III-A). The dataset for the segmentation evaluation is more tricky, as to directly assess it we need per-character segmentation ground truth. To our best knowledge, no public dataset provides such a ground truth. Thus, we use synthetic data (Sec. III-D) for segmenter evaluation.
As for the overall framework evaluation, we use two recently introduced public datasets: MIDV-500 (Sec. III-B) and the test part of the 1961 Census for England and Wales from [61] (Sec. III-C). These datasets are of great interest to us as they provide text line segmentation ground truth, i.e., they allow evaluating text line recognition independently from document location and text line segmentation methods. In fact, such datasets are a rarity within the modern scientific community. Firstly, private datasets are widely used [7], [62], [65] for text line recognition evaluation. Secondly, public datasets are often designed for overall document recognition systems. The vivid example of this problem is a widely used SmartDoc-2015 dataset [32]. This dataset contains camera-captured images of documents and text ground truth. Yet, if one wants to use it for the evaluation of text line recognition quality, they have to employ exterior methods for document detection and text line segmentation.
Since our method employs ANNs, we also need training data. We mainly use synthetic training data as they allow us to achieve sufficient accuracy and to get samples for various languages and scripts. Also, synthetic data provides the necessary font and background diversity without laborious and time-consuming data preparation. Thus, we employ synthetic training data (Sec. III-D) in all experiments except the one with MNIST. We use the MNIST default training sample to compare our classifier with the previously published ones without training data influence.

A. MNIST
MNIST consists of 70000 images of handwritten digits, 60000 of which form the training set, and the rest 10000 are treated as the test set. The state-of-the-art result reported for MNIST is the 0.23% error rate for non-ensemble classifiers. We use this dataset to assess the suggested light-weight architecture of the per-character classifier: we train our percharacter classifier on the training set of MNIST and then calculate the classification error-rate of the acquired classifier on the MNIST test set.

B. MIDV-500
MIDV-500 [11] is of particular interest for us as it contains images of ID samples. Each document is captured in five different conditions with two mobile devices (Apple iPhone 5 and Samsung Galaxy S3). Resulting snapshots contain complex background and various distortions that are typical for images acquired with smartphones [11] (Fig. 1). Text lines in MIDV-500 are printed with multiple proportional and monospaced fonts.
We preprocess MIDV-500 using the provided ground truth to 1) select snapshots with fully visible documents as the dataset contains some images with documents partially hidden off-screen; 2) perform projective rectification; 3) extract images of separate text lines and divide these lines into four groups which are distinguished in the ground truth: ''Dates'' -numeric dates, ''Latin names'' -names and surnames printed with Latin letters without diacritical marks, ''MRZ'' -machine-readable zones [67], and ''Docnum''document numbers. The exact statistics on the acquired text lines are given in Table 1. We do not filter out the blurred, highlighted, or otherwise degraded images. Fig. 4 shows the examples for each type of text line. Initially, the total number of character classes in the test dataset was equal to 70 and included digits, punctuation symbols ''(),.-/< '' and both capital and small Latin letters. As we train case-insensitive classifiers and also unify the letter ''O'' and the digit ''0'', the resulting number of classes became 43.

C. THE 1961 CENSUS FOR ENGLAND AND WALES SUBSAMPLE
The employed subsample of the 1961 Census for England is presented in [61] and is available for download from PRImA website [68]. It contains scanned images of documents printed with one font. This test sample is suitable for our method, as the authors provide the geometrical ground truth as well as the textual one. Thus, to perform experiments, we used the provided ground truth to acquire separate images of text lines. The text lines in this set contain capital Latin letters, punctuation symbols ''&,.-/'' and digits. Fig. 5 presents the text line images from this dataset.

D. SYNTHETIC DATASETS
We utilize no data from MIDV-500 for training CNNs. Instead, we use a method described in [69] to generate a synthetic training dataset. This method allows us to create an unlimited number of images with projective transformation, motion blur, and ink degradation. As we want our model to be as language-independent as possible, we employ no dictionaries in the generation process. To preserve the generalization capability of our ANNs, we do not choose any specific fonts. We utilize 600 different fonts that are available on GoogleFonts [70] and backgrounds that were acquired What is more, we use synthetic data as both training and test sets to evaluate the per-character segmentation part of the proposed method. To be exact, we generate three training datasets (Chinese characters, Cyrillic characters, and a mixture of Cyrillic, Latin, and Armenian characters) and three test datasets (Chinese characters, Cyrillic characters, and Armenian characters). We select these writing systems for the test datasets as, firstly, they are rather different, and, secondly, two of them are in the top five most widespread scripts around the world, and the third one is a rare writing system which is used mostly in Armenia and, consequently, is rarely considered in OCR systems. We do not use the Latin characters in test sets as they are similar to the Cyrillic ones.
Each dataset contains ground truth for each image. The annotation was performed automatically by the data generation engine. The ground truth is given in JSON format and is organized as follows (Fig. 7): 1) one JSON dictionary stands for one text line; 2) ''line_rect'' is the bounding box of the text line, ''cuts_x'' stands for the cuts between the letters, i.e. ideal segmentation results, ''start_x'' and ''end_x'' outlines the exact coordinates of character end points in horizontal direction, ''values'' are the Unicode characters codes in a decimal numerical system, ''let_blines'' stands for upper and lower ordinates of each character, and ''blines'' represents the lines from Fig. 3 except for the ascender line.

IV. A METHOD FOR TEXT LINE RECOGNITION
In a classic Heuristic Over-Segmentation method, a segmentation graph is formed from the vertical projection (projection on the horizontal axis) P [34], [71]. To solve the task, one looks for the best path from the start node to the end node. What is crucial, the cuts from P are usually selected once, before the graph construction. It means that if the cut VOLUME 8, 2020 was initially missed, it could not be restored by the method. Besides, the start and end nodes should be determined beforehand. Such limitations seem to be presumptuous for any practical use in the case of images captured with a mobile device.
To overcome the drawbacks of the classical approach, we propose a method based on two ANNs, namely a segmenter and a classifier, and dynamic programming. To be exact, we employ an FCN NN segm to build P and a CNN NN class for candidate characters recognition. NN segm and NN class are described in Sec. IV-A and Sec. IV-B in detail. Fig. 8 presents the flowchart of our algorithm: 1) Crop an image using baseline and cap line, and scale the resulting image to a predefined height. 2) Apply NN segm and obtain P.
3) Build P 1 as a non-maximum suppression of P. 4) Build pairs of non-zero points from P 1 with the distance from the predetermined interval fd ∈ [fd min , fd max ]. 5) Classify the preliminary candidate character images with NN class . 6) Employ dynamic programming to build the segmentation path optimizing the sum of the cuts scores from P and confidences of the corresponding candidate characters, if any. 7) Classify characters at the positions from the built path that were not processed in steps (3)-(4), if any.
Firstly, we apply NN segm to the cropped text line image scaled to the predefined height and obtain the projection P that represents the score of a cut for every column in the image. Fig. 9 provides the result obtained with the segmentation network for the text line image from Fig. 8.
Then, we build the initial subset P 1 of the most probable cuts from P, selecting the local maxima of P. Based on P 1 , we classify preliminary candidate characters with NN class . The preliminary candidate characters can be mutually exclusive, i.e., their bounding boxes can overlap. It should be mentioned that we do not classify all the candidate characters.  We sort the candidate characters according to the sum of their forming cuts from P 1 , and then select m most probable ones, where m is the quotient of the ratio between input image width and the mean possible character width from fd. After the candidates classification, we employ dynamic programming to select the best segmentation path. What is crucial, the absence of some necessary cuts in P 1 does not inevitably lead to the incorrect result as we allow the discontinuity of the best segmentation path in relation to preliminary candidate characters. To be more exact, when we build a segmentation path, we look into P, not P 1 , for cuts scores and consider all possible candidate characters. We evaluate each segmentation path as the mean of the scores of its candidate characters. The score s of one candidate character is calculated in (1): where x l and x r are the cuts used to form the candidate, c is the confidence of the first alternative obtained with NN class for preliminary classified characters, or 0, if the candidate was not classified. After the best segmentation path is chosen, we check it for the missed character candidates and classify them.
Speaking about candidate characters, we obtain their images not from the cropped image, but from the original input one. The reason is that at this stage we analyze the horizontal projections of individual character images and adjust their vertical borders to handle diacritics, descenders, and ascenders. Also, we allow setting a restriction to the width/height ratio of the selected candidate character images as an input parameter to our framework. This restriction is set in the form of possible minimal and maximal width/height ratio and is used to forbid impossibly narrow or wide symbols. Fig. 10 explains how we acquire the result with dynamic programming in more detail. In Fig. 10, we calculate current image width x by the horizontal axis and the number of characters k by the vertical one. As we mentioned before, we preset the interval of possible character widths fd ∈ [fd min , fd max ]. Thus, we calculate the maximal number of characters k max as: where w img denotes the width of the input image. As we limit the possible width of the character, we do not need to calculate all the possible states in dynamic programming. In Fig. 10, we denote the possible states as green cells. Bright green cells denote the possible initial states, and dark green ones denote the final states. In our solution, we explicitly restrict the possible initial and final states. For any initial state x ∈ [0, fd max ] and for any final state x ∈ [w img − fd max , w img ].
As the initial state of our value function we set the scores of all the possible first characters to: dp(x, 0) = 0.5 × P(x). (3) where x * denotes the previous cut coordinate, C(x * , x) is the confidence of the first alternative obtained with NN class for candidate character between x * and x, or 0 if the candidate was not classified. Fig. 11 shows part of the best segmentation path selection process for the image from Fig. 8.  We use two separate ANNs instead of an end-to-end solution for two main reasons. The first one is that the character recognition quality depends on the correspondence between the alphabets of the classifier and text lines. The second one is that segmentation can be considered as a languageindependent problem for a large number of languages, and, to some extent, as the one independent of the writing system as well. Due to the same reasons, we do not utilize segmentation methods based on specific knowledge about characters in the considered text lines. Conversely, the recognition ANN will be error-prone if we merge several alphabets. For example, the resulting set of characters from all the Latin-script alphabets would contain about ten ''I'' letters with different diacritical marks; composition English and Russian alphabets would include many identical symbols, so we would be bound to perform a postprocessing stage to obtain the result.

A. SEGMENTER
As it was already mentioned, we employ the FCN to compute a vertical projection of the image. Using the information from its receptive field, the segmentation neural network decides VOLUME 8, 2020 whether there is a cut at a given position or not. As a result, we acquire a projection containing the network's scores for a cut at each column in the image. In the inference stage, this neural network can be applied to an image of arbitrary width since all its layers are size-independent. Table 2 describes the architecture of the FCN, which we use in our experiments. The approximate number of trainable parameters of this FCN is 5.7 × 10 3 . To train the network, we use grayscale images of fixed size 131×33 and the ground truth in the form of 131×1 images that represent an ideal projection -zerofilled image with one-filled points at the places of correct cuts. All the images and ground truth are taken from the dataset described in Sec. III-D. The training was performed with the minimization of Euclidean distance between the calculated and ideal projections.

B. CLASSIFIER
To recognize the characters, we employ the ANN with lightweight architecture, which is provided in Table 3. This CNN takes grayscale images of fixed size 15×19 as an input. The number of trainable parameters of this CNN depends on the size of the alphabet |A| as its last layer is a fully-connected one. For a classification task with the alphabet size of 30, it is approximately 3.4 × 10 4 .

A. CLASSIFIER EVALUATION
To begin with, we experiment with MNIST to evaluate the error rate of the recognition network with architecture from Table 3 with |A| = 10. We do not employ any additional training data or data preprocessing but use online augmentation [55]. Table 4 provides the results obtained by the suggested CNN and those of previously published studies, and also shows the number of trainable parameters in the applied ANNs. We get an error rate of 0.25% against 0.23% demonstrated by the best non-ensemble model and 0.14% -by the ensemble one. In Table 4, we present the best results that can be found in the published papers. Reference [72] provides the results known before 2013. We also provide the human performance error rate -≈0.20% -to emphasize that the majority of the classifiers presented in Table 4 are almost equal to human recognition ability. We claim that this  result on the MNIST database proves the applicability of such a light-weight ANN to the OCR problem.

B. SEGMENTER EVALUATION
In the second stage, we focus on the proposed segmentation network and its claimed language-independence. It should be emphasized that by the language-independent method we mean that our segmentation network can be built with a lot of languages and writing systems taken into account. To verify the proposed method, we use the datasets that are described in Sec. III-D. Fig. 12 provides example test images. The images in Fig. 12 are intended to show both the diversity of characters and distortions in the test set. This diversity prevents us from using classic image processing methods.
To evaluate only the segmentation, we exclude the second summand from eq. 1. We calculate the segmentation error rate as: where L total denotes the total number of lines, l i cuts is the ground truth cuts, P i is the computed projection, and CM (P i , l i cuts ) means the result of the discrete two-sample Cramér-von Mises test [56]. Table 5 provides the segmentation results, the value of eq. 5 if no ''cuts'' were found and the worst-case value of eq. 5 for each test set. It should be emphasized that the Cramér-von Mises test depends on the number of ''cuts'' in the text line, on the width of the characters, and on the distance between characters. Consequently, the results of one segmentation algorithm on several datasets cannot be directly compared. In the case of Table 5, it means that we can compare the values in one column, but to compare the values in one row, we should take the ''No segmentation'' results into account. In Fig. 13, we provide an example of such a noncomparability. Fig. 13a presents the beginning of the text line, and 13b shows the cumulative sums of the number of cuts from the ground truth and segmentation result that are used in the Cramér-von Mises test. As one can see, even for the first cut, there is a difference between the ideal and the obtained cuts. Such a small difference is allowable, but it will affect the resulting segmentation rate. The effect caused by such differences accrues with the length of the text line. In Table 5, the segmentation error for the ''Cyrillic'' test set is higher as the average ''Cyrillic'' text line is longer than the ''Chinese'' and ''Armenian'' ones.
As one can see from the results (Table 5), the ''mixture'' segmentation network outperforms the other ones even on the Chinese hieroglyphs that were not presented in its training set. Also, the ''Cyrillic'' network demonstrates the results  which are equal to the ones of the ''Chinese'' network on the Chinese set, but not vice versa. The reason is, on the one hand, the similarity of forms of the most hieroglyphs so that FCN mostly ''saw'' straight line segments and almost no arcs, and, on the other hand, the variety of geometric primitives presented in the Cyrillic script. We hypothesize that results on the Armenian set implicitly confirm this assumption as the ''Chinese'' network demonstrates worse results than the ''Cyrillic'' one.

C. EVALUATION OF TEXT LINE RECOGNITION FRAMEWORK
All components of our framework consider a single-channel image as an input. Thus, when an RGB image comes in, we average the three intensity values of each pixel. Also, we resize the image if it is necessary for the component, i.e., we resize the image to the height of 33 pixels preserving the width for the segmenter (Sec. IV-A) and we resize each candidate character image to 15×19 pixels as required by the classifier (Sec. IV-B). No other image pre-processing is performed. VOLUME 8, 2020

1) MIDV-500 RESULTS
Finally, we experiment with MIDV-500 to evaluate the complete text line recognition framework. We train the segmentation network (Table 2) and the classification network (Table 3) employing traditional single precision (32-bit) computations. As the evaluation measure for the acquired results, we calculate the per-character recognition rate PCR.
where L total denotes the total number of lines, len(l i ideal ) is the length of the i-th line and lev(l i ideal , l i recog ) stands for the Levenshtein distance between the recognized text and the ground truth. Table 6 provides the results of the proposed approach in comparison with Tesseract 4.00, Tesseract 3.05, and ABBYY FineReader 15. In our experiments, we used the modes of both versions of Tesseract OCR and FineReader that allow recognition of Latin characters, digits, and punctuation marks. In Tesseract OCR, we also disabled postprocessing with dictionaries as, firstly, some of the text lines do not have any language model, and, secondly, our framework does not employ any postprocessing. To evaluate the time consumption of the proposed algorithm in comparison with Tesseract, we measure the total time (in seconds) necessary for processing all images for each field on the personal computer on CPU AMD Ryzen 7 1700 on a single thread (Table 7). We do not provide a comparison with the execution time of ABBYY FineReader, as it is an application with a graphical user interface. Also, we do not provide time comparison with methods from Table 4 as they are designed for individual character recognition, not for the text line recognition. A significant problem faced by the recognition systems for the camera-captured images is errors of document boundaries detection. It results in incorrectly rectified field images and distorted characters. To estimate the robustness of our method to such errors, we perform an experiment to model document detection errors employing the method described in [11]. For each image available in MIDV-500, we modify the document quadrangle by the addition of normally distributed noise to each vertex. We change each coordinate by five pixels at most as we want, firstly, to preserve text visibility and, secondly, to avoid highly distorted images that are unsuitable for Tesseract 4.00. Then we acquire the text line images as we did for the original quadrangles and perform the text line recognition process for the resulting image. Table 8 provides the absolute difference between the results from Table 6 and the ones of this experiment.
According to the experimental results, our method outperforms the latest Tesseract 4.00 on all types of the text lines in both accuracy and speed. The proposed method is at least 2.07 times faster than Tesseract 4.00. Moreover, in all fields our method yields substantially fewer errors than both Tesseract 4.00 and ABBYY FineReader 15 even though it  does not employ any language model postprocessing. The results on the ''Docnum'' images are of particular interest as these text lines, on the one hand, contain unstructured data that are unlikely to be represented in the training dataset and, on the other hand, are one of the most important fields of the document. Also, our method mostly produces errors because of its case insensitivity and the independent recognition of each symbol. To be exact, its most frequent inaccuracy is mistaking the capital ''I'' for the small ''l'' and vice versa, e.g., we obtain ''MARlA'' instead of ''MARIA''. Such inaccuracies could be fixed during the postprocessing stage with language models. As for Tesseract 3.05, which was added due to its popularity in earlier studies, it is substantially surpassed by both the proposed method and Tesseract 4.00. The correspondence between the results of Tesseract 3.05 and ABBYY FineReader 15 complies with the previously published results. As for the mentioned speed advantage of our method, we suppose the main reason for it is that we employ two ANNs with approximately 4.0×10 4 weights in total while Tesseract 4.00 employs ANN with 7.8 × 10 5 weights. Judging by results in Table 8, the proposed framework is more robust than the end-to-end solution employed in Tesseract 4.00. Remarkably, the most severe decrease in Tesseract 4.00 PCR happens in the ''Latin names'' text lines, where it demonstrated the result closest to the proposed method. The possible reason for such an outcome is that the isolated recognition errors affect the end-to-end solution more seriously than they influence our method with independent percharacter recognition.

2) THE 1961 CENSUS FOR ENGLAND AND WALES SUBSAMPLE RESULTS
The second experiment that we perform to evaluate the whole framework is the one with The 1961 Census for England and Wales subsample. We use the same segmentation network but retrain the classifier as the previous one does not contain ''&'' and '','' in its alphabet. The training data is acquired in the same way as described in Sec. III-D. The other difference between this experiment and the one with MIDV-500 is character recognition accuracy calculation. The previously published results for this sample are calculated with PRImA TextEval [80]. This tool uses the so-called University of Nevada measure that is based on Levenshtein distance to compute character recognition accuracy. As Tex-tEval is freely available, we use it to get recognition results comparable with previously presented ones. Table 9 provides the experimental results of the various methods and systems on the 1961 Census subsample. It should be mentioned that the authors of [61] retrain Tesseract Engine with the data obtained from the documents of the same 1961 Census for England and Wales. In other words, their training sample consists of character images printed with a specific font while we do not adjust the synthetic data generator for this particular document. The acquired results demonstrate that our languageindependent model surpasses methods that employ both the implicit language models (LSTM in Tesseract OCR 4.00) and the explicit ones (dictionaries in [61]). What is more, the most frequent error of our method is mistaking ''O'' for ''C'' when the former are distorted as demonstrated in Fig. 14. Fig. 15 provides the text lines images with incorrectly recognized ''O''. Our method is prone to these errors as it does not employ any dictionaries and could not understand that ''TOTAL'' is much more probable than ''TCTAL''. Even the simplest dictionary-based postprocessing could significantly improve the accuracy of our method as this error takes 35.10% of all character recognition errors committed by our method.

3) EXTENDED LATIN AND NON-LATIN TEXT LINE EXAMPLES
To check the language-independence of the framework on the real data, we applied it to several images from MIDV-500 containing non-Latin characters and Latin characters   with diacritical marks. To do these experiments, we also trained the classifiers for four different alphabets without re-training the segmenter. We employed the procedure described in III-D to generate training data for the classifiers. Fig. 16 provides the results acquired for Chinese, Azerbaijani, Czech, and Russian characters. According to the results, we claim that the segmentation network can be made language-independent if trained with enough data. It should be mentioned that for recognition of the Chinese characters, we employ an ANN described in [57]. This ANN is similar to our classifier but provides the embeddings of the input images. Thus, the number of neurons in the last layer does not become enormously large, as it represents the dimensionality of embedding space, not the alphabet size. To be exact, the employed architecture proposed in [57] contains 7.7×10 4 weights, which is twice the size of our basic classifier, but it is light-weight in the case of Chinese characters.

VI. CONCLUSION
In this paper, we present our method for text line recognition that employs two ANNs interconnected by the dynamic programming algorithm. The primary motivation for the proposed approach is to solve the per character segmentation task as the language-independent one. As data acquisition is an obstacle for training ANNs for different languages, we utilize only the synthetic training data in our central experiment.
We provide a comparison of the recognition accuracy results of our method, LSTM-based Tesseract 4.00, the algorithmic method from Tesseract 3.05, and ABBYY FineReader 15 on the public dataset for the camera-captured ID recognition MIDV-500. The acquired results show that our framework is essentially better than ABBYY FineReader 15 and both versions of Tesseract OCR. Also, we provide the results of our method on the 1961 Census of England and Wales Project dataset. We achieve the highest recognition accuracy in comparison with previously published results of several LSTM-based and algorithmic methods. Moreover, we would like to mention that the recognition accuracy of our framework in images with names and other dictionary words can be improved significantly by using the language model postprocessing. We demonstrate the transferability of the segmentation network to different scripts if connected with the appropriate recognition ANN.
To justify the applicability of the suggested light-weight classifier, we experiment with the classic MNIST dataset and acquire the results comparable with the state-of-the-art ones. To examine the segmentation method, we show that the segmentation networks trained on the data with different alphabets perform almost equally on different datasets. We employ synthetic data for this experiment as we need percharacter segmentation ground truth.
To conclude, our framework demonstrates the powerful capabilities of employing the FCNs for text line segmentation and of using extremely light-weight ANNs for cameracaptured image recognition.