Robust Arabic and Pashto Text Detection in Camera-Captured Documents Using Deep Learning Techniques

In the realm of Document Image Analysis (DIA), the primary objective is to transform image data into a format that can be readily interpreted by machines. Within a DIA-based system, layout analysis plays a crucial role in pre-processing, for the identification and extraction of precise and error-free textual segments. However, regarding the Pashto language, the document images are not explored so far. Pashto text detection in camera-captured documents is a challenging task due to variations in image quality, lighting conditions, complex backgrounds unavailability of labeled documents, cursiveness, shape-context dependency, multi scripts per image, and language-specific layouts. This research examines the case of Pashto and Arabic text and contributes in two aspects. First, it introduces the creation of a real dataset that contains 1080 images of the Pashto documents captured by a handheld camera. Second, this work examines deep learning based classifiers that can perform layout analysis tasks and detects Pashto and Arabic text per document. For the layout classification, we used deep learning models such as Single-Shot Detector (SSD), Yolov5 and Yolov7. A baseline results are achieved by examining 30% images as a test set and achieve a mean average precision (mAP) of 84.51% on SSD, 88.50% on Yolov5 and 91.30% on Yolov7 respectively. The proposed methods have the potential to contribute to various applications, such as document analysis, information retrieval, and translation, for Pashto and Arabic language users.


I. INTRODUCTION
Document images are those digital images that are produced either from scanner or camera.Such documents include books, articles, postal addresses, bank cheques, forms, topographic maps, engineering drawings, license plates, and billboards, etc. [1].These document images are in pixels The associate editor coordinating the review of this manuscript and approving it for publication was Turgay Celik .form, and they could not be searched and analyzed in computers [2].In addition to that, these images occupy large space on computer's storage and hence present a challenge to space factor.To solve and convert such kind of document images into a digital or readable format, we need a specific system where we could analyze document images.Thus, such a field is named as Document Image Analysis or shortly DIA.The DIA is the subfield of Artificial Intelligence (AI), and further, it can be categorized as one of the applications of the computer vision (CV).The DIA system and its five components are shown in Figure 1.While the major focus of this research is on the layout analysis and classification as highlighted in Figure 1.
Layout analysis of document and classification is a third step that is used to find out the physical structure of a document and to determine the components of a document.The fundamental components of a document are text-lines, scripts, number of columns, and non-textual regions (such as graphs, table, figures, charts, etc).Before the OCR step, these components need to be classified and their order of composition and hierarchy information need to be preserved.After the OCR stage, such preserved information is required to render back the given document [3].
This research highlights the problem of layout analysis and classification considering the Pashto language, further it analyses the very basic layouts regarding the DIA system.However, regarding the Pashto language, there is a little work in the area of DIA system.The reasons are language-specific complexities, which include; writing direction, availability of different languages per document and language-specific layouts etc.For example, several Pashto documents contain Arabic as well as Pashto text in a single document (as shown in Figure 2).Ignoring this specific pattern will lead to the extraction of textual blocks that contain either Arabic text or Pashto text or mix of both.As a result, it becomes a multi-language case for OCR, and could not be handled easily on a single OCR system.
The major contributions of this research are given below.

II. RELATED WORK
There is an immense research work regarding DIA system.However, this research is mainly related to document layout analysis.Therefore, the related work covers only the Layout Analysis stage of a DIA system.The following section explains the very related work in detail.

A. WORK REGARDING LAYOUT ANALYSIS
O'Gorman and Kasturi [4] presented a comprehensive book that covers a document spectrum system for the structural layout analysis of the page.They have discussed various techniques including the bottom-up, nearest-clustering technique on-page elements.The book can be considered as good starting point for those who have interest in DIA systems.
Simon et al. [5] introduced a bottom-up approach for the layout analysis of document using Kruskal's algorithm [6] and to build the structure of the real page through the utilization of a particular distance metric between the segments.Their algorithm is more limited according to the computational complexity, because of its linear structure regarding the amount of the relevant elements [4].
Thomas Breuel and Warren [7] introduced many novel algorithms and statistical methods for layout analysis.The methods consist of (1) to find rectangles of tall white spaces and assess them as candidates for the channels, separators of column, etc (2) to find the text-lines concerning to the column-structure of the document, (3) to recognize paragraphs, headings and titles based on spacing, size and indentation, etc. and (4) to determine the reading order by using geometric and linguistic information.These algorithms are also applicable to Cursive Script.
Laven et al. [8] presented an algorithm using statistical patterns like grammar-based and rule-based techniques.They first introduced a unique software for the manual segmentation and labeling of the page.Their dataset contains a 932 pages as images from academic journals. 1hafait et al. [9] introduced a system for the layout analysis of the cursive script.Their specified scheme experimented on 25 scanned images taken from various sources like magazines, newspapers, and books.Their algorithm obtained 90% precision in line detection, while in case of newspaper images 72% precision achieved.Shafait [10] also proposed a system for the classification and layout analysis of Breuel(Roman script text-line model) [11] to Nastaliq script Erkilinc et al. [12] introduced an algorithm for document classification and page layout analysis.They tested a module for text detection which is based on wavelet analysis and Run Length Encoding (RLE) [13] method, and a second module to detect the image and graphic sections in the input document.
Bukhari et al. [14] introduced a system for the layout analysis achieving a well organized and robust text and nontext segmentation, text-line extraction, and reading order determination methods for Urdu and Arabic document images.
Tran et al. [15] proposed a system for the analysis of the textual and non-textual elements in document images.Their method is a mixture of white-space analysis method with multi-layer uniform areas.The system was validated on page segmentation competition held by ICDAR-2009 [16].They achieved above 90% accuracy for text detection, nontext detection, text region detection, and non-text region detection.
Ahmad et al. [17] introduced a text-line extraction method for the extraction of titles and large headings in cursive script i.e.Pashto [18], [19].Their method is based on Horizontal Projection Pro-file (HPP) [20] and Hanning window smoothing technique [21].They obtained an accuracy of 99.30%.However, the system needs de-skewed images for a better performance.
Goswami et al. [22] presented a multi-lingual text detector using Faster Regional Based CNN (Faster RCNN) to detect English, Hindi, and Gujarati text from Images.However, Faster RCNN is the predecessor of SSD and Yolo models.Another, their approach considers the text as one class and does not classify the text further into its respective languages.Khan and Mollah [23] used deep learning model to localized multi languages in images.However, their approach only localized text and could not recognized the text in to their respective languages.
It can be concluded that a research related to multi script detection regarding cursive script (like: Arabic and Pashto) is limited.Therefore, our research in terms of multi script detection per image document is very important and presents technological advancement to fill this gap.

III. DATASET CREATION
Data plays an important role in the evaluation of any technique.It becomes more crucial for a task like layout analysis and classification when an appropriate data is not present.Thus, this research also contributing the creation of a new dataset as a benchmark for examining layout analysis in mix languages (e.g., Arabic and Pashto) as shown in Figure 3.We call the newly dataset as Camera Captured Pashto Text Imagebase (CCPTI).Additionally, the dataset has been created from real world data that covers the real challenges of layout analysis and classification.The new dataset contains real-world samples of camera captured images presenting an appropriate benchmark for Pashto DIA system.Indeed, this research not only contributes to the scientific domain but also helps the research community regarding regional languages.On the other hand, real data presents more challenges due to the inclusion of different stages in the process.Next section explains the acquisition and creation process of a new dataset.

A. DATA ACQUISITION
The images are acquired via handheld camera.We choose two books (Tafseer-ul-Quran and Meshkat-sharif) that contain plenty of pages.However, we have captured only those pages where Pashto and Arabic text per page were notable.Additionally, while capturing the images, it was insured to avoid the skew and perspective distortion.However, blurriness and shadow effects are present due to various lighting condition in the acquired images.

B. DATASET ANNOTATION AND DESCRIPTION
In supervised learning, data must be transcribed or annotated with suitable labels.In our case, it is important to label/ transcribe the Arabic and Pashto text blocks separately.
135790 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.We use a tool created by MIT named LabelMe 2 for the annotation.The annotation of each textual block is done by considering its contour/edges by taking polygons.The relevant annotation for each image is stored in a separate .jsonfile.The prefix of the image filename is same as annotation or json file.The detailed description of the dataset is also given in Table 1.Further, in pre-processing, we did nothing except the width of each image is set to 600 pixels by keeping the aspect ratio locked.In this way, each scanned image have width as 600 pixels while its height will be of variable sizes.

IV. PROPOSED METHODOLOGY
To accomplish the objectives and to examine the layout analysis phase of DIA in the cursive script language, we used the Convolution Neural Network(CNN) with a deep learning approach.Our proposed models are Singleshot detector(SSD) [24], Yolov5 3 and Yolov7 [25].All the models are designed for the detection of objects by providing the bounding boxes.However, we extend these models to exploit the very basic structure of Pashto and Arabic text by assuming the textual blocks as objects.The subsequent sections explain the architecture of the proposed models in more detail.

A. SSD MODEL
Unlike its predecessor R-CNN and Faster R-CNN [26], the SSD does not use selected region proposal network, instead, 2 http://labelme.csail.mit.edu/Release3.0/ 3 https://github.com/ultralytics/yolov5it predicts the boundary boxes for the classes directly from the feature maps in a single pass.That is why it is famous for speed and performances concerning object detection problems.The very important and appealing aspect of SSD model is to handle the different scales and aspect ratios of the objects.In our case, the textual blocks may vary in size and provide different ratio with respect to height and width.Such appealing feature of SSD makes it suitable for our problem.The architecture of an SSD model consists of two main components, (1) the SSD backbone and (2) the SSD head.Here, the backbone is the Visual Geometry Group (VGG-16) network as a feature extractor [27] (similar to the CNN in Faster RCNN) [24], [26].The SSD head usually contains one or more convolutional layers.The main function of the SSD head is to interpret the output of the SSD backbone into the bounding boxes and classes of objects in their respective spatial location via the activation of the final layers.Figure 5 shows a typical SSD model.

B. YOLOV5
Yolov5 an acronym for 'You only look once', is a complete object detection framework that divides images into a grid system.Each cell in the grid is responsible for detecting objects within itself.

C. YOLOV7
Yolov7 is the latest object detection deep learning model that super-pass all the Yolov variants with respect to accuracy and speed.Yolov7 model based on Efficient Layer Aggregation Network (ELAN) and proposed E-ELAN architecture that increase the speed and time [25].

D. EVALUATION CRITERIA
We use two different metrics to evaluate our proposed model.One is mAP(mean average precision) [28] shown in Eq: 1 where avP(q) is the average precision (AP) for a given query and Q is the total number of queries.The second metric is IoU(intersection over union) [28] shown in Eq: 2 The experiments were carried out by splitting the dataset into Train-set and Test-set.We used hold out method, in which nearly 70% of the data goes to training-set and the rest 30% goes to testing-set.In the training phase the Train-set was evaluated and the loss and classification error were monitored.Once the error converged, we stopped the training and the model was saved as a checkpoint.Meanwhile, the evaluation of the test-set was also done after each 100 epoch.We train three deep learning models namely SSD, Yolov5 and Yolov7.

A. EXPERIMENT ON SSD MODEL
The whole experimental process was achieved via Object Detection API [29], [30].It is a customized tool that empowers everyone to create their powerful image classifiers.Further, the frozen model was used for the evaluation of individual as well as entire images from the test-set.
The training process has been done by Google Colab.By running Tensor-Board on our machine the training and evaluation process has also been monitored.The training ran over 20k steps with a batch size of 24 and achieved good results in about 8 hr 30s in 21210 epochs.
135792 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. EXPERIMENT ON YOLOV5 MODEL
We fine-tuned 'yolov5s.pt'4for our dataset with image size 416, batch size 16, and number of epochs is set to 500.The training process stopped automatically after 247 epochs because the training process did not improve the accuracy for the last 100 epochs.The best accuracy is obtained at epoch No: 147.

C. EXPERIMENT OF YOLOV7 MODEL
We fin-tuned the 'yolov7.pt'5for our dataset with image size 640, batch size 8, and number of epochs is set to 200.The best accuracy is obtained at 62 epoch.

VI. RESULTS AND DISCUSSIONS
For SSD, after 21210 epoch, we have stopped the training, and the final model was selected for evaluation.The evaluation was done by examining the test set, and the proposed model achieved mean average precision (mAP) as 84.58%.Similarly, the same procedure was carried out to complete the training process for Yolov5 and Yolov7.The test results show that Yolov7 outperforms the SSD and Yolov5 models.TABLE 2 shows the comparison of the SSD, Yolov5 and Yolov7 models in terms of mAP.
Further, to check the effectiveness of the proposed work, a counter experiment has been done on Detectron26 [31] model.The Detectron2 model is famous for its effectiveness in Facebook platform and its better performance on Pub-LayNet7 Dataset [32], [33].The Detectron2 was fine-tuned on our dataset for 3000 epochs and then the final model was evaluated on the test set.The overall performance of the models that we evaluated plus the comparison on the Detectron2 model is given in TABLE 2.
Similarly, the training processes were tested for how they reduce the total loss for SSD, Yolov5 and Yolov7 models.

A. DISCUSSION
After a detailed analysis of results as well as individual images, we have findings that are explained in the following sections.Before we could empirically assess the results, Figure 7 shows some prediction of our proposed models on completely unseen data.To understand the visual representation of our examples, please understand the notions mentioned here.The blue, yellow, green color for the bounding boxes represents ''Pashto text'' and the red, magenta, aqua and light green color represent ''Arabic text'' while GT represents the bounding boxes for the ground truth.

B. OBSERVATION NO 1
The first row in the Figure 7 represents the case where SSD, Yolov5 and Yolov7 models have shown similar performance and produces equal detection results.

C. OBSERVATION NO 2
The middle row represents images with a moderate complexity which is shown in Figure 7. Visually all the three models have given acceptable detection of textual blocks.However, in terms of bounding boxes, some large text with smaller bounding boxes were skipped.But, that small bounding box was enclosed in a larger textual block.Apparently, this should not be an error but regarding the IoU metric, it leads to increase the error.Besides these errors, still all the three models performed well and equally.

D. OBSERVATION NO 3
However, the last row in Figure 7 shows some images with fewer detection and miss-classification for SSD model.However, Yolov5 and Yolov7 show the better results than SSD.Visual inspection yields, that SSD was not that much effective to classify textual blocks that are enclosed in smaller bounding boxes.On other hand, the Yolov5 and Yolov7 performed better in classification of textual blocks that are having small bounding boxes.

E. OBSERVATION NO 4
With respect to time comparison, Yolov7 produce better results than Yolov5 and SSD model.Yolov7 converge and produce best results at epoch 62, Yolov5 produce at epoch 147 and SSD produce the best result at 21210.It is due to the fact that Yolov7 architecture is reform with new module such as Extended Efficient Layer Aggregation Network (E-ELAN).

VII. CONCLUSION AND FUTURE WORK
This work for the first time presents a study regarding the layout analysis and classification of Pashto document images.The research particularly examined the classification of Arabic text vs Pashto text in Pashto document images.This work contributes mainly in two aspects.First, we have created a new dataset that contains real Pashto document images.The images are acquired via a handheld camera.The dataset will be a significant resource for the research community for analysing the DIA domain in cursive scripts.
135794 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Further, the second contribution is the application of deep learning-based methods to examine how we can detect/ classify Arabic text vs Pashto text in a single document image.We have chosen the SSD, Yolov5 and Yolov7 models.The SSD, a hybrid model containing VGG16 as convolutional layers and a Neural network for learning high distinctive features.On other hand, Yolo5 and Yolo7 are famous for their light weight, speed and accuracy.Yolov7 extends the Efficient Layer Architecture Network known as E-ELAN which increases the accuracy and speed of Yolov7 among all Yolo versions.We achieved an mAP of 84.51% on SSD model, 88.50% mAP on Yolov5, and 91.30% mAP on Yolov7.The results show that Yolov7 is far better than Yolov5 and SSD.In addition to deep learning models like, SSD, Yolov5, and Yolov7, a comparison was also done with Detectron2 model by fine-tuning the already trained model on our train set.However, evaluation on test set gives us mAP as 83.61% which is comparatively less than Yolov5 and Yolov7.
In future, we would like to extend the dataset with more layouts that are present in the contents of the Pashto language.Also, we can used CNN based architectures that are more sophisticated compared to the SSD, Yolov5 and Yolov7.

FIGURE 1 .
FIGURE 1.A generic view of a DIA system.The stage with a red rectangle is the focus of this article.

1 )
Creation of a new dataset based on camera captured images of Pashto documents.2) Fine tuning of CNN Based deep learning models including SSD, Yolov5 and Yolov7 for baseline evaluation.Rest of the paper is organized as follows.Section II reviews the related work.Sections III and IV introduce the dataset creation and proposed methodology.Sections V and VI present experimental setup and discuss the results in detail.Finally, Section VII discuses conclusion and future works.

FIGURE 2 .
FIGURE 2. Sample images with different layouts from dataset with their ground truth annotation where Pashto text enclosed in blue bounding-box and Arabic text enclosed in light green bounding-box in a Pashto document.

FIGURE 3 .
FIGURE 3. Samples of Camera-captured images containing Pashto and Arabic text-blocks.

Figure 4
Figure4shows two samples from CCPTI dataset along with their ground truth illustration.

FIGURE 4 .
FIGURE 4. Annotated samples of Camera-captured images containing Pashto and Arabic text-blocks.

FIGURE 6 .
FIGURE 6.The loss was gradually reduced for SSD, yolo5 and yolo7.

Figure 6
depicts the value of loss regarding each epoch in the training phase for the respective models.

FIGURE 7 .
FIGURE 7. Visual results for SSD, Yolov5 and Yolov7 compared to the original input image along with ground truth.

TABLE 1 .
Sources of acquired books for our newly created CCPTI dataset.