Handwritten Urdu Characters and Digits Recognition Using Transfer Learning and Augmentation With AlexNet

Automated recognition of handwritten characters and digits is a challenging task. Although a significant amount of literature exists for automatic recognition of handwritten characters of English and other major languages in the world, there exists a wide research gap due to lack of research for recognition of Urdu language. The variations in writing style, shape and size of individual characters and similarities with other characters add to the complexity for accurate classification of handwritten characters. Deep neural networks have emerged as a powerful technology for automated classification of character patters and object images. Although deep networks are known to provide remarkable results on large-scale datasets with millions of images, however the use of deep networks for small image datasets is still challenging. The purpose of this research is to present a classification framework for automatic recognition of handwritten Urdu character and digits with higher recognition accuracy by utilizing theory of transfer learning and pre-trained Convolution Neural Networks (CNN). The performance of transfer learning is evaluated in different ways: by using pre-trained AlexNet CNN model with Support Vector Machine (SVM) classifier, and fine-tuned AlexNet for extracting features and classification. We have fine-tuned AlexNet hyper-parameters to achieve higher accuracy and data augmentation is performed to avoid over-fitting. Experimental results and the quantitative comparisons demonstrate the effectiveness of the proposed research for recognition of handwritten characters and digits using fine-tuned AlexNet. The proposed research based on fine-tuned AlexNet outperforms the related state-of-the-art research thereby achieving a classification accuracy of 97.08%, 98.21%, 94.92% for urdu characters, digits and hybrid datasets respectively. The presented methods can be applied for research on Urdu characters and in diverse domains such as handwritten text image retrieval, reading postal addresses, bank’s cheque processing, preserving and digitization of manuscripts from old ages.

translating handwritten records into a digital format, OCR 76 (Optical Character Recognition), Urdu machine transliter-77 ation, integration with other languages, image restoration, 78 automatically reading postal address, house numbers and 79 robotics [7], [10], [11], [12], [13]. 80 One of the main issues in the classification of handwritten 81 characters is the massive variety in the types of handwriting 82 by various peoples in distinct languages, that the recognition 83 system has to deal with. The variations in writing style, 84 shape and size of individual character and similarities with 85 other characters makes the hand written recognition add to 86 the complexity. Machine learning and deep Neural Network 87 techniques have been widely used for automatic recognition 88 of characters and digits of different languages, and in various 89 classification-based problems [14], [15], [16], [17]. 90 This research aims to apply pre-trained Convolution Neu-91 ral Network (CNN) approach for recognition of handwritten 92 Urdu characters, since a little amount of work has been done 93 in literature so far in this direction. The pioneer dataset used 94 in this research was introduced in 2020 [1], in which the 95 authors applied unsupervised algorithm called autoencoder 96 and CNN for recognition of Urdu handwritten characters. 97 AlexNet is one of the simplest deep learning model and has 98 shown commendable performance in the ImageNet Large-99 Scale Visual Recognition Challenge (ILSVRC), in the past 100 few years. The distinguishing characteristics of AlexNet as 101 compared to other deep learning models are: having much 102 more filters in each layer, pooling layer in addition to stacked 103 convolutional layers, faster computing time and limited hard-104 ware dependency [18]. In the proposed research, we have 105 proposed two frameworks for classification of hand-written 106 characters and digits using pre-trained AlexNet neural net-107 work. The performance of the proposed research is evaluated 108 on recently introduced dataset for handwritten characters and 109 digits of Urdu [19]. 110 The main contributions of this research are as follows: 111 • First we applied the pre-trained CNN AlexNet as the 112 basic transfer learning model and used the extracted 113 features to train the SVM classifier. We tested differ-114 ent transfer configurations to obtain the optimal clas-115 sification performance for Urdu characters and digits 116 recognition.

117
• Second, fine-tuning of the pre-trained CNN AlexNet 118 hyper-parameters and data augmentation is applied to 119 memorize the exact details of training images and avoid 120 overfitting. Transfer learning is applied to transfer the 121 layers to the new classification task.

122
• Quantitative comparison between the classification per-123 formance for the pre-trained CNN AlexNet using the 124 SVM classifier and transfer learning from the fine-125 tuned AlexNet for feature extraction and classification 126 is presented.

127
The rest of the article is organized as follows: literature 128 review covering the current state-of-the-art is presented in 129 Section II. Section III describes the architectural details of 130 the pre-trained CNN AlexNet model and section IV presents 131 character image and achieves 97.1 % accuracy while requir-165 ing only 3.3 megabytes of storage. 166 Ahmad et al. [27] applied a Stacked Denoising Autoen- of 93% to 96%, which are better than previous Urdu OCR 182 (Optical Character Recognition) systems [27]. However, the 183 scope of study in [27]   [37] proposed a method 301 based on script spatial and temporal knowledge and many 302 others with comparatively low performance as compared to 303 recent trends of deep learning.

304
Deep learning has overtaken traditional machine learn-305 ing as the method of choice for the majority of AI-related 306 challenges over the past several years. Deep learning has 307 repeatedly shown its better performance on a number of 308 tasks, including speech, natural language, vision, and playing 309 games. This is the obvious cause for this.

310
Compared to traditional Machine Learning (ML) meth-311 ods, deep learning approaches can be applied to a variety 312 of domains and applications far more simply. First, using 313 pre-trained deep networks for various applications within the 314 same domain is now efficient with transfer learning.

315
For instance, in computer vision, object recognition and 316 segmentation networks frequently use feature extraction 317 front-ends that were trained on pre-trained image classifica-318 tion networks. The full model's training is facilitated by using 319 these pre-trained networks as front-ends, which frequently 320 leads to better performance in a shorter amount of time.

321
Additionally, deep learning's fundamental principles and 322 methods are frequently extremely portable across fields. For 323 instance, since the fundamental concepts are relatively sim-324 ilar, understanding how to apply deep networks to the field 325 of natural language processing isn't too difficult once the 326 underlying deep learning theory for the domain of speech 327 recognition is understood. This isn't the case with conven-328 tional ML (Machine Learning) at all, as feature engineer-329 ing and domain-and application-specific ML techniques are 330 needed to create high-performance ML models. Depending 331 on the topic and application, the knowledge base of classical 332 ML differs significantly and frequently necessitates in-depth 333 specialist study in each field.  to make them suitable for AlexNet because the network 409 process 3 channel images. The proposed framework com-410 prised of 5 convolutional layers, followed by 3 Max Pooling 411 (M-POOL) and RELU layers. Consider an input image I, 412 of size W x H x C is subjected to a CONV L i with K 413 square kernel size and M output maps. Then N i = WHM , 414 P i = K 2 CM and U i = WHK 2 CM are the number of output 415 units, weights (parameters) and connections respectively in 416 case of CONV layers. When they are subjected to FC layers, 417 N i = WHM , P i = K 2 H 2 CM and U i = W 2 H 2 CM are the 418 number of output units, weights (parameters) and connections 419 respectively in case of CONV layers. For CONV1 layer, there 420 are 96 kernels (output channels) each of size 11 × 11 × 3 421 with Stride 4 the input W and H shrink by a factor of 4. The 422 convolutional is defined as represented in Equation 1 for the 423 image I with (i,j) dimension. Where G is the feature map and 424 F is the convolution filter [51]. 431 After ReLU pooling is applied to reduce the number of 433 features, which actually reduces the sizes of the input image 434 which is sent to the next convolutional layer. The activation 435 process by ReLU and pooling is repeated after every convolu-436 tional layer. For CONV2 layers, there are 256 kernels (output 437 channels) each of size 5 × 5 [52]. For next two CONV layers, 438 there are 384 kernels (output channels) each of size 3 × 3. The 439 last CONV layers has 256 kernels (output channels) each of 440 size 3 × 3. The summary of the units, weights and connec-441 tions of the proposed method is presented in the

455
The main objective of this study was to analyze the 456 performance of pre-trained CNN AlexNet for handwrit-457 ten urdu digits and characters recognition. We have fol-458 lowed two approaches for our system as shown in 459 Figure 3.  is most the efficient, popular and is being widely used.

471
It is a supervised learning technique and mostly utilized 472 for image classification, outlier detection and for regression 473 purposes [53], [54]. It is the simplest algorithm which creates 474 hyperplane to separate the data into number of classes.  type of isolated learning is that it lacks memory. It does not 508 store previous knowledge and apply it to future learning. As a 509 result, in order to learn successfully, a huge number of train-510 ing instances are required. To save time and overcome iso-511 lated learning problem we used an approach based on transfer 512 learning with a deep neural network model named AlexNet 513 with data augmentation. AlexNet is pre-trained model and has 514 the ability to easily categorize objects into 1000 categories. 515 The main reason to use data augmentation with AlexNet is to 516 reduce over fitting and to manage the learning capacity of the 517 Neural Network (size of the NN).

518
In transfer Learning approach, the basic objective is to 519 learn the conditional probability distribution in new or tar-520 get domain with the knowledge learned from old or source 521 domain and from old or source task [55]. Formally it can be 522 represented as, for a Task T = L, p( * ) which can consist of 523 a label space L and related predictive function p( * ) which 524 can be learned from existing or from source domain (training 525 data) then the probability distribution of task T can be written 526 as [56] and [55]: where L is a label space, p( * ) is the predictive function, F is 529 the feature space.

530
The images of the dataset Urdu Characters, Digits and 531 Hybrid are divided into 70:30 ratio for training and test-532 ing purpose. The conversion from gray scale to red, green 533 VOLUME 10, 2022 and blue channel is performed as a preprocessing. Then the 534 images are subjected to AlexNet model for feature extraction. 535 We have extracted features from CONV5 (fifth convolution We have changed the fully connected layers to the same 588 size as to the number of classes in dataset for example we 589 have set it 40 for urdu characters, 10 for digits and 50 for 590 hybrid dataset. For fully connected layers, the values of 591 ''WeightLearnRateFactor'' and ''BiasLearnRateFactor'' are 592 increased to speed up the learning process in new layers than 593 the transferred layers. The following hyper-parameters are 594 tuned during the research as shown in Table 2.      The UHaT dataset have handwritten characters and digits 638 of Urdu language. There are 40 classes of characters and 639 10 classes of digits. The dataset is made by Hazrat Ali and is 640 written by 900+ individuals [19]. The resolution of all images 641 is 28 × 28. The dataset is publicly available. Class repre-642 sentatives from digits and characters categories are shown in 643 Figure 9.  Also referred as sensitivity and it is calculated as    Table 3.  Table 5 shows the classification accuracy comparison of 709 the proposed research with the traditional classification mod-710 els as Logistic Regression, KNN classifier, Neural network 711 and SVM for the digits category. The experimental results 712 demonstrate that the proposed approach AlexFT outperforms 713 the state-of-the-art methods by achieving 12.21%, 6.23%, 714 6.12% and 2.42% higher recognition accuracy as compared 715 to Logistic regression, KNN classifier, Neural network and 716 SVM respectively. The AlexFT achieves highest recogni-717 tion accuracy as compared to the state-of-the-art algorithms 718 CNN [19] and Autoencoder [19] by providing 1.51% and 719 0.91% higher classification performance. The average pre-720 cision, recall, F-score and error rate for the digits dataset 721 are obtained for AlexFT i.e. 98.23%, 98.21%, 98.22% and 722 1.79% respectively.The comparison of the proposed research 723 approaches demonstrate that AlexFT achieves state-of-the-art 724 performance as compared to the deep learning algorithms and 725 the proposed AlexSVM.

726
The confusion matrices obtained using the four different 727 settings with Alext-SVM, and AlexNet-FT with fine-tuned 728 hyper-parameters and data augmentation for Urdu digits 729 are shown in Figure 12 (a-e) respectively. The rows in the 730     Table 6. It can be seen that among different settings The quantitative results for hybrid dataset are presented 751 in Table 7. It can be evidently seen that AlexSVM (FC6) 752 outperforms the AlexSVM (CL5), AlexSVM (FC7) and 753 AlexSVM (FC8). However, the best classification perfor-754 mance is obtained by the proposed AlexFT. AlexFT out-755 performs the state-of-the-art methods i.e. Autoencoder and 756 CNN by 12.92% and 12.12% respectively. It can be safely 757 concluded that both the proposed methods i.e. AlexSVM 758 (FC6) and AlexFT outperform the state-of-the-art results, 759 with AlexFT being the best approach in terms of recognition 760 accuracy achieved. The average recision, recall, F-score and 761 error rate for the hybrid dataset are obtained for AlexFT i.e. 762 94.91%, 94.88%, 94.75% and 5.08% respectively. 763 VOLUME 10, 2022