Urdu-Text Detection and Recognition in Natural Scene Images Using Deep Learning

Urdu text is a cursive script and belongs to a non-Latin family of other cursive scripts like Arabic, Chinese, and Hindi. Urdu text poses a challenge for detection/localization from natural scene images, and consequently recognition of individual ligatures in scene images. In this paper, a methodology is proposed that covers detection, orientation prediction, and recognition of Urdu ligatures in outdoor images. As a first step, the custom FasterRCNN algorithm has been used in conjunction with well-known CNNs like Squeezenet, Googlenet, Resnet18, and Resnet50 for detection and localization purposes for images of size $320\times 240$ pixels. For ligature Orientation prediction, a custom Regression Residual Neural Network (RRNN) is trained/tested on datasets containing randomly oriented ligatures. Recognition of ligatures was done using Two Stream Deep Neural Network (TSDNN). In our experiments, five-set of datasets, containing 4.2K and 51K Urdu-text-embedded synthetic images were generated using the CLE annotation text to evaluate different tasks of detection, orientation prediction, and recognition of ligatures. These synthetic images contain 132, and 1600 unique ligatures corresponding to 4.2K and 51K images respectively, with 32 variations of each ligature (4-backgrounds and font 8-color variations). Also, 1094 real-world images containing more than 12k Urdu characters were used for TSDNN’s evaluation. Finally, all four detectors were evaluated and used to compare them for their ability to detect/localize Urdu-text using average-precision (AP). Resnet50 features based FasterRCNN was found to be the winner detector with AP of.98. While Squeeznet, Googlenet, Resnet18 based detectors had testing AP of.65,.88, and.87 respectively. RRNN achieved and accuracy of 79% and 99% for 4k and 51K images respectively. Similarly, for characters classification in ligatures, TSDNN attained a partial sequence recognition rate of 94.90% and 95.20% for 4k and 51K images respectively. Similarly, a partial sequence recognition rate of 76.60% attained for real world-images.


I. INTRODUCTION
As autonomous vehicles and other intelligent devices/Cell-Phones/mobile devices [1]/Robots [2] are coming online and they also need to understand the environment in which they are operating. Urdu is, cursively written complex language than other non-cursive languages like English, Russian, etc. Urdu is the national language of Pakistan and also 6-Indian states [3], Hence covering more than 260 million people. Different kind of instructions for people's guidance The associate editor coordinating the review of this manuscript and approving it for publication was Chenguang Yang .
is often written everywhere in public places, especially on different sign-boards, hoardings, banners along the roads. Non-availability of outdoor datasets for the training of models is a problem for Urdu-text/script. So, the availability of outdoor Urdu text datasets facilitates in evaluating different kinds of learning models for judging their effectiveness for text detection, Orientation prediction, and ligature recognition is needed. Beside that there is a need for a robust script detection system that can locate Urdu script in complex open outside environments, from a given input patch of an image. It may need orientation correction and consequently, the detected text region is further processed by Optical Character VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Recognition (OCR) to recognize what has been written. The output then, in turn, can be used for content analysis and indexing. The main motivation for Urdu photo-OCR in the modern-day world is due to its various applications like scene understanding, robot / autonomous vehicle navigation, text reading for visually impaired, image retrieval, data mining of google-street-view images, real-time multi-lingual translation product Advertisement identification, Political people's posters identification, etc.
Detection of Urdu-text is challenging due to some of the following problems like types of scripts (Nastaliq, Naskh, handwritten, hybrid, etc.), scale, large variance of text patterns, oriented text, complicated background, color, contrast, perspective distortions, image quality and variation of text size [4]- [6]. Figure 1 shows key challenges in the detection of Urdu-Text/Script. These issues in detection make it difficult for any photo-OCR system to detect and recognize text with maximum confidence. Detecting text (especially English and Chinese) in an outdoor environment has been done extensively [7]- [10]. Recently, little work has been done on outdoor Urdu-text detection in images [11] along with Arabic text detection [12]. Although, there are research studies on artificial Urdu-text detection in video frames [4]- [6], [13]- [17]. While regarding Urdu script orientation determination no well-known study exists. Generally, any text detection and recognition system (also called photoOCR) for outdoor imagery consist of two main stages/pipelines, detection, and recognition. Detection is the first step for the next recognition stage in any photoOCR system. Its importance is 2-fold as if the first module is not capable of robustly identifying the text regions, then the later stage of recognition is doomed and results in a low recognition rate. Natural outdoor text images are more challenging than simple written Urdu text on documents/books as different writing styles are found in natural outdoor images, with complex backgrounds, orientations, illumination as shown in Figure 2.
In the context of a training machine learning algorithm from given Urdu text images, constructing annotation of these images are a difficult task due to multiple ligatures and multiline ligatures. Traditionally projection profiles [18], [19], Connected Component [3], [20], or MSER [21]- [23] based segmentation has been done to partition digital images into multiple segments of text or non-text regions. Text segmentation/detection is thus inevitable for further recognition of text.
Much of the current research on Urdu recognition is performed on the cleaned and segmented artificially generated Urdu Nastaliq text such as Urdu Printed Text Images (UPTI) [24], custom extracted [15], generated text with clear background [25], video tickers [26] or handwritten Urdu text [27] as opposed to extracting from outdoor or real-world images with complex background. This work is a step in that direction that integrates synthetic Urdu-text in natural outdoor images. The word text/script is often used interchangeably in the literature.  where the text has been given in a clear white background. For pure outdoor Urdu-text detection, we were only able to find one relevant study [16]. Chandio et al. [11] and Ali et al. [28] used manually cropped Urdu characters from natural outdoor scenes for recognition of Urdu-characters, while in [16] Asghar et al. used custom annotated Urdu images with ICDAR2017-MLT Arabic text for detection purpose, they used a combination of their custom collection of Urdu images and Arabic images from ICDAR 2017-MLT [7], They have dealt with Urdu and Arabic text at the word level. A similar kind of study is done by Ahmed et al. [29] and Oulladji et al. [30] for Arabic outdoor text. Similarly, numerous studies [3], [31]- [36] exist for the recognition of Urdu and give comparisons to prove the worthiness of their research. Besides custom used datasets [6], there is only one famous Artificial Urdu dataset for text detection, and that is Artificial-Urdu-Text-Dataset [37], that is openly available. It consists of 1000 video frames from news channels and is openly available. Table 1 summarizes the results of Detection on artificial Urdu text in video frames, and the highest result achieved so far, has a Precision of 0.83 [5].
As datasets are the benchmark for evaluating the accuracy of algorithms, Table 2 summarizes the frequently used datasets in Urdu text detection and recognition along with the state-of-the-art datasets used for English and other languages. It can be seen that the English Language has the most available datasets [40]- [43] with the text style of a horizontal, oriented, and curved text. Also, English datasets with the multilingual text [7], [8], [15], [44] cover other languages than Urdu text. The number of images in English datasets varies from 500 to more than 60K. Although researchers have mentioned Urdu text datasets [6], [15], [16], [37] for detection and recognition, only one dataset [37] is publicly available for text detection in video frame images. Also, scholars [45], [46] have done bilingual recognition on MSRA-TD [43] and other datasets.
This paper performs a comprehensive analysis of CNN features for determining their ability and effectiveness for Urdu text detection in outdoor pictures at the ligature level, the orientations of ligatures, and the recognition of individual characters in a text. The scripting style, which is focused on in this study is Nastaliq. To perform a balanced analysis using CNN features, a synthetic dataset with two variations of ligatures is created. The main contributions of this paper are as follow; • The development of datasets with embedded Urdu text in natural scene images. These synthetic images contain 132 and 1600 unique ligatures respectively for 4.2k and 51k images, with 32 variations of each ligature. It means each ligature is projected to 4-backgrounds with 8-color variations of font.
• Comparison of CNN model's features for evaluating their ability to detect Urdu text using the larger image-input layer FasterRCNN algorithm. Convolutional features of well known CNNs, that is Squeeznet, Googlenet, Resnet18, and Resnet50 are used for comparison.
• The first study in Urdu text/script detection literature, which has used four different kinds of CNNs features for detection purposes.
• Prediction and Correction of Oriented Urdu text in images.
• TSDNN based recognition of Urdu Ligatures using Resnet50, googlenet features and BLSTM for both synthetic and real outdoor images The rest of the paper is organized as follows. Section 2 describes the proposed methodology, datasets, various feature extractors used, the evaluation protocol, RRNN, and TSDNN. Section 3 discusses the experimental results and Section 4 gives general discussion. while finally Section 5 concludes the paper and gives future directions.

II. PROPOSED METHODOLOGY
The proposed methodology consists of four main modules. First is 'Synthetic Dataset Generation', Second is Custom 'FasterRCNN builder', Third is 'Regression Residual Neural Network' for Urdu text angle prediction. While fourth is Two Stream Deep Neural Network (TSDNN) for recognizing the sequence of characters in an Urdu text.
First Module Generates five types of Synthetic dataset (SD): SDAi, SDA and SDB, SDAi embeds CLE text ligatures as a small image at random locations in Other images of size 320 × 240. While in SDB images of small size with VOLUME 8, 2020 rotated Urdu text embedded in it. Ligatures are rotated in the range of −50 • to 50 • . SDA further has two sets SDA4 and SDA51 corresponding to 4.2K and 51K images respectively for recognizing the sequence of Urdu characters. Similarly, SDB has two sets SDB4 and SDB51 corresponding to 4.2K and 51K rotated ligature images. The holdout ratio of 0.9 was chosen for all sets. SDAi also has 4.2K images.
The second Module consists of a Custom FasterRCNN builder, which generates four different types of FasterRC-NNs. In it, two sub-modules, a CNN model-selection module and the other is the training module, which trains Faster-RCNN using the features from the selected CNN model. This module has two inputs, one is an image and the other is the location of Urdu text in that image. Each CNN input layer is changed according to parameters as mentioned in column-5 of Table 3, it is a preprocessing step before any further processing by the second module. Figure 4 illustrates the working of the proposed methodology. Initially, an input image with annotation of the bounding box (rectangular coordinates) is given along with the selected CNN model. The bounding box(es) identifies the region of interest (ROI) in CNN features. With these bounding boxes, constrained features from training images are selected. Then FasterRCNN trains a classifier and a regressor/bounding box predictor for given regions. Classifier classifies the region as foreground (text) or as background. While on the other hand, regressor modules learn to predict the positions of potential bounding boxes in each image. The third module is a Regression  Residual Neural Network (RRNN) for predicting the angle of rotation of a given ligature. Which is trained to predict the orientation of written Urdu ligature. The fourth module is Two Stream Deep Neural Network (TSDNN), that can take the detected or ground truth rectangles in the image and recognizes the text/ligature written in that image, essentially, it's a kind of sequence to sequence classifier.
In the end, accuracies are calculated for each module as per measures being used in the computer vision field.

A. DATASET PREPARATION FOR DETECTION
To train any text detector presence of quality and uniform training data is the key thing, and it results in a good quality detector. For our comparison proposes, we have generated more than 4.2K images (SDAi) of more than 132 unique ligature images/classes from the CLE [39] dataset's textual annotation. It needs to be noted that the originally CLE dataset has ligature images with a black background and varying aspect ratio. Few images from the CLE dataset are shown in Figure 6. We just choose textual annotation of each ligature from CLE, then for each ligature, a transparent ligature-image is generated which is embedded in an outdoor natural scene image with 28 variations at random locations. These variations are created by using 4-different backgrounds and using Nastaliq fonts in 8-different colors namely 'cyan', 'white', 'red', 'green', 'blue', 'black', 'magenta', and 'yellow'. The dataset (SDAi) is divided into 2-parts. The first part consists of 3801 training images, while the second part consists of 423-test images. The hold out ratio is 90% and 10% respectively for training and testing images. Synthetically generated Ligature variations with different backgrounds and color fonts from our dataset are shown in Table 4.

B. CUSTOM FEATURE FASTERRCNN BUILDER
A Convolutional Neural Networks (CNNs) or Deep Neural Networks (DNNs) are a special class of neural networks, that are primarily developed to identify, locate and recognize visual features directly from 1D, 2D or ND matrices. In this study, image pixels serve as an input to CNNs. Four existing deep neural networks, namely Squeeznet, Googlenet, Resnet18, and Resnet50 were used as a feature extractor for training a FasterRCNN algorithm. The detail of each CNN model is given in Table 3 that shows the layer's name, from where the features are taken for further processing by FasterRCNN. It also shows the size of the query image to be resized before feature extraction by the corresponding layer. The Four FasterRCNNs are trained for 20-epochs each using stochastic gradient descent with momentum (SGDM) algorithm for weights update. The learning rate was set to 0.001 for the entire 20 epochs. An overlap ratio of 0.5 to 1 is used as a positive sample for learning. The anchor boxes which are chosen for training for all FasterRCNNs were [67], [57]; [136,116]; [272,232]. The hold-out ratio was kept 0.9, resulting in 3801 images as a train-set and 423 images as test-set. Train-set images were used for training, while test-set images were used for testing the detection accuracy. Table 3 also mentions the number of layers in each of the selected CNN, along with the name of layers used for extracting the CNN features, necessary for training the FasterRCNN. The detailed inner working of this module can be understood by Algorithm 2.

1) SQUEEZENET
Squeezenet is a small deep neural network with only 5MB size. It has an 80.3% accuracy in top-5 for imagenet [48] and was released in 2016. Squeezenet was developed goal was to create a smaller neural network with fewer parameters that can more easily fit into mobile computers and easily transmittable over a network [49] and Squeezenet Architecture is beautifully explained in [50]. The main idea in Squeezenet is squeeze and expand blocks.
In Squeeze block, 1 × 1(point-wise) filters are used to replace 3 × 3 filters, and in expand block we use 1 × 1 filters as a bottleneck layer to reduce the depth of the computation by following 3 × 3 filters. Combinedly these two blocks are called fire module as shown in Figure 7. A Squeezenet layer consists of layers of fire modules and several max-pooling layers plus the global average pooling layer. Feature map size remains the same due to the squeeze-layer and expand-layer of the fire module. While, on the other hand, pooling reduces the depth to a smaller number. The network expects an input of 227 × 227 pixels.

2) RESNET18
Resnet stands for Residual Network and is based on the concept of skip module [51] as shown in Figure 8. Residual networks are constructed by utilizing skip connections or jumps over some layers and connect layers feature values. Typically, there are single layer jumps or skips. An additional weight matrix may be used to learn the skip weights; these models are known as HighwayNets. ResNet-18 [52] is a convolutional neural network (CNN) that is trained on more than a million images from the ImageNet [48] dataset. The network originally had a depth of 18 layers and its architecture is graphically explained in [53]. It can classify images into 1000 object categories. Objects that are classified by this net are common household items, living things such as a keyboard, remote, TV, mouse, pencil, and many animals. After training through ImageNet for so many examples, this network has learned rich feature representations for a wide range of images. The network expects an input of 224 × 224 pixels.

3) RESNET50
ResNet-50 is also a CNN, that is trained on more than a million images from the ImageNet [48] dataset. It is an FIGURE 6. Sample images from CLE dataset.   In its architecture, researchers have connected the previous or earlier outputs after bypassing 2-layers and are applied to continue till before FC-layer (Fully Connected). Bypassing 2 layers is a key intuition, as per authors bypassing a single layer did not give many improvements. Two-layer jump between layers can be thought of as a Network-In-Network model.

4) GOOGLENET
Googlenet [55] is 144-layers pre-trained convolutional neural network. It is the winner of the ILSVRC 2014 competition. It was developed by Google team, and sometimes known as Inception V1. Googlenet had a top-5 error rate of 6.67% in the competition. This error-rate was very close to the human-level performance. And was a challenge for competition organizers to evaluate the results. As its performance was rated against human's performance. These days, we can have pretrained Googlenet trained on 2-kind of datasets, it is either the ImageNet [48] or Places365 [56] dataset. The first network was trained on ImageNet like other mentioned CNNs and can classify images into 1000 classes. On the other hand, a variant model is also available that was similarly trained on Places365 as on ImageNet, but it classifies images into 365 different place categories. The categories in this dataset are field, park, runway, lobby, etc.

Algorithm 1 TSDNN testing Algorithm
Algorithm for Training The time for generating region proposals in FasterRCNN is much smaller than the edge box algorithm. Succinctly, the ranking of anchor boxes is done by RPN which indicates the most likely anchor boxes, which are most likely to contain objects of interest. So, generating region proposals is faster in FasterRCNN and is better adjusted to input data. Faster-RCNN produces two types of output, one is the classification category and the other is coordinates of predicted rectangles. Basic components and working mechanisms of FasterRCNN are nicely depicted in [60].

D. EVALUATION PROTOCOL (AVERAGE PRECISION)
Average precision (AP) or mean average precision (mAP) is the most commonly used metric in measuring the accuracy of object detectors like R-CNN, Faster YOLO [62] also known as YoloV2, SSD [63], etc. Also, in projects involving information retrieval, researchers come across familiar terms. mAP is the average of AP. In it, the AP for each class is calculated and then their values are averaged. Under the Common Objects in Context (COCO) [44] text context, there is no difference between AP and mAP. Average precision computes the average precision value for recall value over 0 to 1.
In the PASCAL Visual Object Classes (VOC) [65] and MS COCO challenge [44], AP is appropriate and commonly used metrics for object detection and information retrieval tasks. In our context of Urdu text detection, AP depends upon the concept of IoU (intersection over a union of rectangles or bounding boxes) as shown in Figure 5.
Y in Figure 5 is representing ground-truth rectangle and S is representing a detected/predicted rectangle containing text. The final decision regarding text is detected or not decided according to the following condition.

IF IoU > ϕ [True or Text Detected] ELSE [False or Text Not Detected] END
Here ϕ is representing a certain threshold of say 0.5. If the rectangle is correctly identified against an annotation, then it's a True-Positive (TP). If the rectangle is predicted but there is no text, then its false-positive (FP).

precision = Intersection of GT (Object) and Predicted Rtectangle Predicted Rtectangle
Also, if there is a text in an image according to annotation but the system doesn't recognize that text, in that case, it is false-negative (FN). Mathematically Eq. 1 shows precision, here GT represents ground truth object and the other is predicted/detected rectangle. Also, it can be understood easily in graphical form, as shown in Figure 9.
For calculating AP, we iterate through all image queries as per Eq. 1. While for calculating mAP, Eq. 2 can be used. Here Q is the number of image queries. Average P(qi) is the average precision (AP) for a given query image category, qi (as a single image can contain multiple detections). It simply means that, for a given query image category, qi, we compute its corresponding AP for each category, and then the mean of each category across all query images. Then all AP scores would give us a single number, and that is called mAP. It quantitatively describes how good the trained model is for detecting rectangles with respect to ground truth rectangle coordinates.
So, AP measures the accuracy of are your predictions. i.e. the percentage of your positive predictions that are correct from retrieved corpus [66] .

E. LIGATURE ORIENTATION PREDICTION
A residual classification network is adapted to work as a regression residual neural network (RRNN) to learn the ligature's angular rotations [66]. 63-Layer Residual neural network with the Regression layer is shown in Figure 10. RRNN has residual or direct connections from preceding layers similar to Resnet architecture, that bypass the normal flow of parameter values through network layers to forward layers. These residual/shortcut connections enable the gradients to propagate more easily from the regression layer to the Upper layers of the network during the training phase. That makes it possible for those layers to robustly learn desired features for a given problem. The RRNN with an image input layer of size 140 × 100 was trained on 3801 images and Tested on 423 images (SDB4), and it is also trained on 46036 images and Tested on 5115 images (SDB51). The learning rate of 1e-3 and minibatch size of 50 was chosen for training. The predictions of the model are shown in Figure 11, where the green-line shows the ground truth angle, while the red-line shows the predicted angle for a given ligature image.
Various predictions for 4-ligature images are given. As Detected Ligature's angle of rotation can be determined by the RRNN, their orientation can also be corrected as shown VOLUME 8, 2020  in Figure 12. Oriented ligatures can be seen on the right side of the figure, where tile of images either left or right can be seen against a black background. These corrected ligatures then can be fed to OCR for improved recognition of text. The model can correct ligatures rotation between the range of −50 • to 50 • .

F. LIGATURE RECOGNITION
The Two Stream Deep Neural Network (TSDNN) module in the proposed methodology is inspired by [67], with the difference of handling color images, number of features, and using CNN features found to robust for Urdu text detection, i.e., googlenet and resnet50. These two CNNs are selected based on their detection performance in the detection step in our experimentation. The length of the feature vector obtained from googlenet is 1 × 104388, and resnet50 gives a feature of length 1 × 200704. When these two features are integrated, they give a feature vector of 1 × 304192. Later features extracted from both these CNNs are arranged into a grid pattern of 4992 × 64, which have 319488 cells or values, including zero paddings for values beyond a true number of feature values. Here 64 corresponding to 32 Unicode characters needs to be learned. It means 2 values per Unicode serve as training in the network.
They are arranged so that features have positional dependency which needs to be learned in the later stages by double-layer BLSTM. Generally features extracted in earlier steps, are considered as a sequence of inputs σ , and corresponding ligature labels as a sequence of required target vectors as ρ. In our case, ρ has a maximum length of 32 primary Urdu characters inclusive of spaces. If a ligature length is less than 32 characters, then the remaining space is treated as a space character. The learning rate of 0.0001 and minibatch sizes of 40 and 200 is selected corresponding to SDA4 and SDA51 images models. TSDNN module shown in Figure 4 succinctly describes the main components in this module. Also, Algorithm 1 describes the steps required for training TSDNN model, while Algorithm 3 describes steps to recognize the Urdu ligature text in test-set images.

III. EXPERIMENTAL RESULTS
To vindicate the proposed methodology, the potential of the methodology is demonstrated through experiments on each submodule by tuning the parameters according to various modules. The brief description of conducted experiments are described below. The proposed methodology was implemented using Matlab and ran on NVidia GTX1070 GPU based hardware. Two sets of synthetic datasets were used, First A total of 4.2K ligature images were processed and secondly, 51k ligature images were processed to evaluated different modules.

A. DETECTION OF LIGATURES
Synthetic outdoor text dataset images (SDAi) were presented to four trained FasterRCNNs. Each FasterRCNN model was trained on different well known CNN's extracted features. Each trained model produced its output in the form of rectangles or bounding boxes on the input images. Table 5 shows the prediction of bounding boxes for similar ligature input query image, using dissimilar CNN model features. Column-1 shows the query image containing the ligature. While columns 2nd-to-5th shows the predicted rectangles by FasterRCNN corresponding to different feature extractors. It can be seen that column-2 has multiple detections for Squeeznet feature-based FasterRCNN.   (MFstResnet50, η_ res50  and Googlenet (test AP .8836) can be ranked 2nd and 3rd in terms of their ability for text detection respectively. Albeit Resnet18 (train AP .9366) is higher than Googlenet (train AP .8816). The worst test-set AP of.6585 was obtained by using Squeeznet. Train-set Average precision reflects similar pattern results. Although, Squeezenet was created with fewer parameters and has a small size for use in mobile computers. It appears that fewer parameters may be right for classification purposes, but it's not a suitable feature extractor for Urdu script/text detection purposes. Figure 13 graphically summarizes the results of our experiments and the overall performance of all 4-CNN models on both the train and test set, where resnet50 performance lines for train-set and test-set can be seen as the top performer with the most time required for training. Similarly, squeeznet can be seen as the worst performer for the Urdu text detection problem.

B. ORIENTATION OF LIGATURES
For judging the robustness of RRNN, two types of experiments were done. The number of training and testing  images vary in each experiment. In one experiment RRNN was trained using 3.8k images (SDB4) and evaluated using Threshold Th of 3 for evaluating the .421k images. In it, an accuracy of 79.70% for training set and 55.34% testing accuracy was achieved. While with a threshold Th of 5, training and testing accuracies were 95.16% and 79.33% respectively.
In another experiment, 51k (SDB51) images were used. The number of images in the train-set was 46036 and 5115 were in the test-set. RRNN was again evaluated using Threshold Th of 3 for evaluating the 51k images. In it, an accuracy of 92.37% for training set and 97.77% testing accuracy was achieved. While with a threshold of 5, training and testing accuracies were 99.99% and 99.06% respectively. These results show more training and data improves accuracy. Table 6 and Table 7 depicts these results.

C. RECOGNITION OF LIGATURES (SYNTHETIC DATASET)
Ligatures/Urdu text are considered as a sequence of primary characters in Urdu script with a maximum of 32 basic isolated characters. Recognition rate based on two measures i.e., exact and partial sequence matching was calculated for given ligature images. Sometimes sequence may match partially,  meaning they may differ by only one or more characters, then they are not considered as a positive match for exact or one-to-one but do for partial match. So, the partial sequences of recognized characters or Urdu ligatures were matched by Levenshtein-distance [68], also sometimes known as editdistance [69]. Table 8 shows the difference between oneto-one and partial matching mechanism and how it changes accuracy values. VOLUME 8, 2020  The exact/one-to-one match happens if each character(s) appears in the correct order and is the same [67], even if a single character mismatch occurs, it is considered a zero match.
For judging the capability of TSDNN, 2-types of experiments are conducted to determine the accuracy of the model. The number of training and testing images vary in each experiment. In the first experiment (SDA4) 3.8k images for training and 0.42k images for testing selected for training and testing respectively. While in 2nd experiment (SDA51) 46k images were used for TSDNN training and 5.1k images for testing. Few images from SDA51 were also eliminated from training or testing due to minor reasons like too much near or crossing the boundary of an image by annotated rectangular coordinates. For the SDA4 test-set, we obtained a partial sequence recognition rate of 94.90% while a similar measure for SDA51 achieved 95.20%. Finally, TSDNN generates editable Urdu text. Table 9 elucidate the results on both train and test-set.

D. RECOGNITION OF LIGATURES (NATURAL DATASET)
To demonstrate the robustness of TSDNN, it was also trained and tested on real outdoor images of more than 1000. This dataset contained real outdoor images, which were collected and annotated by us. Few of the images from this dataset have been shown earlier in Figure 2. It can be seen that real-world images contained Urdu text of varying length. Overall this dataset contained more than 12000 (12K) Urdu characters. 6-Types of experiments were carried out to determine the sequence classification capability of the proposed TSDNN model. In the first experiment, TSDNN was trained on real images and tested against real-world images. While in other experiments, the real world test-images are tested against 4k and 51k models, that were trained only on synthetic images. And it shows very fair results. Table 10 shows the outcome of all experiments.
Synthetic models show very good results even on real-world trainset images on which they were never trained on. 46K synthetic image model has a performance of 69.55% which is just nearly 7% shorth of real-world images trained model of 76.6%. It further highlights that as real-world images have a variety of font styles and noise, further expanding the synthetic data with noise and font style augmentation can narrow the performance gap between synthetic image model and real-world images based model. Few of TSDNN's recognition results in graphical form for real outdoor test images are shown in Table 11, where columns show original and predicted text. While a small green band shows the missing predicted character, in predicted text.

IV. DISCUSSION
we presented a novel methodology for Urdu text, covering the entire spectrum of both text detection and recognition as well as orientation prediction. Also, synthetic Urdu datasets with text embedded in outdoor images are generated. For determining the best CNN for Urdu text detection, the first set of 4.2K synthetic dataset images were taken as an input to CNNs. 4-different CNNs are used as feature extractors to train four FasterRCNN models. These models are then evaluated to compare their abilities for the task of Urdu text/script detection. FasterRCNN performed best for detection purposes, using Resnet50 based features with AP of.9812. Googlenet and Resnet18 have comparable performance. While Squeeznet features don't seem suitable well for Urdu text detection.
To Determine the rotation angle of the detected ligatures a custom residual regression neural network was trained on both sets of the synthetic datasets. Their accuracy in determining the angle of rotation for Urdu text was 79.33% for the first set and 99.06 % for the second set. These trained models can then further be utilized to correct the angle of text or used to further improve the recognition accuracy of text in the next stages of OCR.
TSDNN used googlnet, resnet50 features in combination with 2-layer BLSTM to recognize Urdu text. The model achieved Training accuracy of 99% on both sets, while the accuracy of 94.9% and 95.2% achieved for 4.2K and 51K datasets respectively. Also, the exact-match rate of 46.92 % on test-set shows that increasing the number of images improves exact-match accuracy. TSDNN results on real-world images are very encouraging because the synthetic models were not trained on a word or even lines containing multiple characters, but were able to fairly handle a variety of length of Urdu text in real outdoor images. Experimental results have demonstrated that the proposed Urdu text methodology is effective for both Urdu scene text detection and recognition scenarios, due to its high accuracy and completeness in Urdu text structure feature extraction using resnet50 and googlenet. It is also observed that more artificial training samples are essential in improving and validating the Urdu text recognition rate and Orientation prediction rate.

V. CONCLUSION AND FUTURE DIRECTIONS
In this paper, the proposed methodology, covering essential phases of photoOCR is presented and evaluated using various sub-architectures. These phases include the detection of text, Orientation determination of text, and finally recognizing the written text in outdoor images. Also 5-subsets of datasets generated with a maximum of 51K images, for evaluating the 3-main phases of PhotoOCR. The performance of datasets namely SDAi, SDA4, SDA51, SDB4, and SDB51 is assessed by presented work. By implementing the proposed methodology, the achieved performance is very encouraging. For detection problem, Resnet50 based FasterRCNN was found to be best for Urdu text detection, while for text rotation angle prediction our RRNN found to very successful.
Regarding Urdu text recognition proposed TSDNN neural network was a resounding success for both synthetic text and real-world out-door images. TSDNN finally generated editable Urdu ligatures in Unicode format.
In the future, large datasets with more transformations covering noise, rotation, color, background, font variations are vehemently required for judging the robustness of this and other techniques. They are necessary parameters for any successful Urdu script/text detector as well as a robust sequence classifier for Urdu text. Other CNN models like inception [70], Nasnet [71], Xception [72] features can be evaluated. Powerful detectors ultimately reduce the false outcomes of the photoOCR systems, which will ultimately improve the results of OCR in terms of recognition rate.