Comparative Study on Deep Convolution Neural Networks DCNN-Based Offline Arabic Handwriting Recognition

Recently, deep learning techniques demonstrated efficiency in building better performing machine learning models which are required in the field of offline Arabic handwriting recognition. Our ancient civilizations presented valuable handwritten manuscripts that need to be documented digitally. If we compared between Latin and the isolated Arabic character recognition, the latter is much more challenging due to the similarity between characters, and the variability of the writing styles. This paper proposes a multi-stage cascading system to serve the field of offline Arabic handwriting recognition. The approach starts with applying the Hierarchical Agglomerative Clustering (HAC) technique to split the database into partially inter-related clusters. The inter-relations between the constructed clusters support representing the database as a big search tree model and help to attain a reduced complexity in matching each test image with a cluster. Cluster members are then ranked based on our new proposed ranking algorithm. This ranking algorithm starts with computing Pyramid Histogram of Oriented Gradients (PHoG), and is followed by measuring divergence by Kullback-Leibler method. Eventually, the classification process is applied only to the highly ranked matching classes. A comparative study is made to assess the effect of six different deep Convolution Neural Networks (DCNNs) on the final recognition rates of the proposed system. Experiments are done using the IFN/ENIT Arabic database. The proposed clustering and ranking stages lead to using only 11% of the whole database in classifying test images. Accordingly, more reduced computation complexity and more enhanced classification results are achieved compared to recent existing systems.


I. INTRODUCTION
Psychological science [1] highlighted the importance of using handwriting in our personal life for retaining more information and improving personal performance and comprehension. Handwriting helps us to re-frame topics in our own words. This re-framing process triggers parts of our brains that are not activated by typing verbatim on digital devices. One of the most recent challenging tracks of computer vision is automatic Arabic handwriting recognition. Although it has been a pervasive research field for a long time, a lot of effort is still needed for achieving better recognition accuracy The associate editor coordinating the review of this manuscript and approving it for publication was Xin Luo . and fast response time. This field is important to represent handwritten text in a digitized symbolic form.
In spite of all the technological progress, we can never deny that ancient civilizations made significant contributions and presented valuable handwritten manuscripts that need to be documented digitally. Automated handwriting recognition aims to reserve old manuscripts and to create electronic libraries of digitized handwritten documents. It is useful in many other areas like forgery detection and signature analysis. Much efforts and time are saved when a handwritten text is automatically recognized and digitized.
Formerly, authors categorized the handwriting recognition process into online and offline processes [2]. Online recognition process uses more information than the offline recognition process, thus achieving higher accuracy levels. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Online recognition is based on the sequential order of writing and the instant temporal information, while offline recognition is based solely on images and pixel information, and according to literature,is more challenging. This paper is concerned only with offline recognition. Arabic is one of the major worldwide languages used in documenting sources [3], and it poses many challenges due to different handwriting styles and changeable letter shapes corresponding to different spatial localities. Around 27 languages use the Arabic alphabet. Arabic is the native language of more than 420 million people around the world, making it the sixth most widely spoken language [4]. It is a cursive language and has 28 different characters. Characters are not represented in lower or upper cases but each character has four different shapes according to its position in the word: isolated, start, middle, and end. Arabic alphabets are written from right to left. Each word may consist of several subwords called Piece of Arabic Words (PAWs). A PAW can be a single letter or a set of connected letters. Among the peculiar features of the Arabic language is that some characters do not join to their left neighbors, as shown in Figure 1a. One stand-alone character does not join to left or right neighbors, as shown in Figure 1b. Some characters are very similar and can be only distinguished by their dots number and positions, as shown in Figure 1c. Some diacritical marks can be placed above or below particular characters and change the sound and/or the meaning of the character. Automatic recognition of Arabic handwriting requires building a hybrid recognition system, computing different types of features, and applying various classifiers together to achieve improved performance. Different classifiers [5], [6], were applied in a parallel or cascading manner to achieve high automatic recognition rates. Data-mining raw databases [7] is also essential for better meaningful data representation and reduced testing time and consequently higher recognition rates.
Offline Arabic handwriting recognition approaches are mainly categorized into two main types. The first type represents the traditional approaches that are based on extracting features, while the second type represents the deep learningbased approaches that automatically extract features from raw images. The traditional approaches apply three main consequent steps: preprocessing, feature extraction, and finally classification.
Recently, deep learning techniques demonstrate better recognition performance than the other crafted feature detectors in the field of computer vision and building recognition systems. Deep convolutional neural networks (DCNN) is one of the most prominent deep learning methods. It is a sort of feed-forward network that can extract valuable topological features from raw images. Instead of extracting the traditional features, neural networks are capable of extracting features from large-scale raw data [8]. Recent surveys stated the effectiveness of deep learning techniques [9]- [11], in the field of character recognition. There are different proposed architectures in the field of deep Convolution Neural Networks DCNN [12]- [15]. Many research efforts were directed to design new effective architectures for feature extraction and classification issues.
We propose a state-of-the-art solution in Arabic handwriting words holistic recognition. This approach is composed of three consecutive stages: matching, ranking and classifying. The first main contribution is representing the IFN/ENIT as a big search tree-like model of inter-related clusters. Each cluster includes the database classes with similar regional and geometric features. This model aims to reduce the matching process complexity. The matching process relates each test image with one of the constructed clusters.
The second main contribution is proposing a new ranking approach that is applied before classification with CNN for reducing computational complexity. The ranking stage sorts the matching cluster's members from the nearest to the furthest relative to the test image. Accordingly, only a set of high-rank classes passes for final classification. Finally, we investigate the effect of the most popular DCNN architectures in the final classifying stage in terms of accuracy and different complexity measures. The proposed approach achieved the highest accuracy among all the previously proposed holistic approaches.
The efficiency of deep CNN requires a lot of training samples which increases the computational complexity. The proposed multi-stage approach aims to compensate this increased complexity by reducing the number of database classes participating in the final classification stage Different DCNN architectures were recently introduced, each suggests a different innovative idea as a contribution to improve learning efficiency. Each architecture differs in number and design of the convolutional layers. Some of these different architectures were applied before for text recognition, while others weren't. It was challenging to apply the different architectures in our research problem and to investigate their effect on the final classification stage of our proposed approach. Section II summarizes previous related work. The section includes an analysis of offline Arabic handwriting recognition systems. Section III lists our contribution in points and introduces our proposed approach. Experimental results are presented in section IV. Complexity Analysis is presented in section V. Finally, conclusion and future work are presented in sections VI and VII, respectively.

A. CONCEPTS AND ANALYSIS
In the field of automatic handwriting recognition, applying a combination of hybrid cooperative classifiers together is 95466 VOLUME 8, 2020 a thought-provoking issue to enhance the overall system performance [8]. There are two main categories of combination techniques: feature fusion and decision fusion. Feature fusion is based on combining different features into one feature vector followed by applying one classifier. Some approaches [16] emphasized the effectiveness of this early integration method. Decision fusion techniques integrate different classifiers' decisions; each classifier is trained on a different set of features. More recent approaches [8], [17], [18], stated that this high-level combination strategy improves the performance of handwriting recognition systems.
One of the important concepts related to building robust recognition systems is feature localization. Different automatic Arabic recognition systems [8], References [19]- [21], applied sliding window-based feature extraction techniques to do localization. It was also stated [22] that applying multiple sliding windows achieved superior recognition rates. A special sort of localized features [23] called Pyramid histogram of oriented gradients (PHoG) was applied to achieve localization on the ordinary HoG features.

B. CATEGORIZATION BASED ON SEGMENTATION APPROACHES
Some approaches [24] are based on segmenting words into letters. Other recent systems [5], [25], segment words to different categories of segments: core shapes, sub-core shapes and diacritics. Still others [19], [26]- [28], recognize words in a holistic manner without any prior segmentation. This holistic way avoids recognition errors that could be caused due to wrong segmentation, but it requires robustness of features. The proposed holistic approaches [19], [26]- [28], stated that statistical and structural features achieved this required robustness. On the other hand, some proposed analytical approaches [19], [22], [26], are based on modeling each word by concatenating its character models.

C. DATABASES
The number of available Arabic databases is limited. Among the most well-known are HACDB, which is a character database, Ibn Sina dataset of Arabic manuscript, and IFN/ENIT database [29] of Arabic words (as described later in section IV).

E. INTEGRATION OF CLASSIFIERS
Different approaches [8], [34], applied a combination of multiple Bidirectional Long Short-Term Memory -Connectionist Temporal Classification (BLSTM-CTC) architectures. These applied deep neural networks were integrated together for building a more robust Arabic handwritten recognition system. The recurrent neural network (RNN) was also integrated with BLSTM [30] and showed effectiveness. A combination of multidimensional RNN (MDRNN) and Connectionist Temporal Classification (CTC) was proposed [40] and the error rate was 8.57% on IFN/ENIT. Another approach applied multidirectional long short term memory (MDLSTM) -based system [41] and achieved satisfactory results with a recognition rate of 89.9%. For overfitting avoidance, a combination of Deep Bidirectional LSTM (DBLSTM) and a special type of RNN was proposed [42].
Other approaches [35] integrated CNN and Support Vector Machine (SVM). SVM was applied with the RBF kernel for the final classification. CNN was also integrated with HMM in other approaches [43], [44]. CNN was applied for feature extraction and classification as well [45]. A combination of multiple residual networks (ResNet) [18] was applied for Arabic words recognition and achieved superior results that outperform the single residual network.
Different types of deep networks, like deep belief network (DBN) [48] and multi-column deep neural network (MCDNN) [49], were recently proposed and achieved satisfactory results but still need to be improved. In a comparison [50] between DBNs and Vertical-Horizontal HMM, the latter showed more feasibility when applied to the IFN/ENIT. Another approach [51] proved that the combination between a Convolutional DBN (CDBN) and Support Vector Machine (SVM) outperforms the DBN-based approaches.

F. POST-PROCESSING SUPPORT
It was observed that some proposed approaches achieved high recognition rates due to the support of the post-processing stage after classification. Some approaches [8], [34], applied the Connectionist Temporal Classification layer (CTC) for sequence labeling after classification. Others [5], [25], [30], VOLUME 8, 2020 achieved high recognition rates due to using a dictionary for post-processing.

G. CONCLUSIONS AND RESULTS
Different systems applied a combination of hybrid cooperative classifiers together. Such a combination has been described as thought-provoking as it enhances the overall system performance. Accordingly, our approach includes different classifiers that use a combination of different extracted features in a cascading fashion.
Based on the literature, [35] pre-trained and fine-tuned CNNs were suggested to be explored. Applying Support Vector Machine on features extracted by CNN [35] achieved 92.95% recognition rate. This CNN-based features outperformed other SVM approaches when applied on other traditional features [7], [20], [38]. Accordingly, our approach studied the effect of SVM and CNN in the classification phase.
It was stated [19], [27], [28], that local densities are suitable for all cursive languages, and statistical features are computed faster than other types of features. When MSHMM was trained on a set of statistical and structural features, [46] achieved 91.1% recognition rate [26], and when trained on a combination of density-based features and contourbased features it achieved 79.8% recognition rate.
The analytical approaches [26] such as HMMs outperformed holistic ones, but they have a large number of unstructured parameters and Viterbi algorithm is expensive in terms of time and memory. The HMM outperforms BLSTM [8], [24], [32], when applied on the distribution of concavity features due to character modeling, while the main advantage of the holistic approaches [19] is the absence of the pre-segmentation step. Relatively low recognition rates were achieved by holistic approaches.
K-means algorithm was applied for clustering [7] and achieved 85% recognition rate. The clustering process enhanced the final performance of SVM when applied on HoG features. On the other hand, PHoG [37] outperformed the ordinary HOG features. Based on the out-performance of the PHoG, it is employed in our approach and achieved satisfactory results.
The performance of traditional RNN was enhanced by applying different residual models. The temporal residual achieved higher accuracy than hybrid temporal and spatial residual models. On the other hand, the training process is accelerated by hybrid temporal and spatial residual models.

H. APPLICATIONS ON DEEP LEARNING AND WORD RECOGNITION
Some applications in the field of Arabic handwriting recognition were proposed recently. An approach [52] based on hybrid peer-to-peer (P2P), Grid computing, and agent technology was applied to distribute tasks among multiple peers and grids for tolerating several types of faults. The approach [52] used character segmentation and pattern matching, achieving fast execution time, and optimum speedup factor. Another approach [53] combined Histogram of Oriented Gradients (HOG) and Gray Level Run Length (GLRL) features, and applied decision fusion on scores for implementing a better writer identification system. Another application [54] was designed for the early detection of Parkinson's disease. Short time series were used to train long short-term memory (LSTM) neural networks. Other adaptive neural networks (NNs) [55] were designed for tracking and controlling non-linear systems. Another deep learning approach [56] diagnosed faults of electric motors.

III. THE PROPOSED APPROACH
This approach is composed of three consecutive stages: matching, ranking, classifying, as shown in Figure 2. The output of each stage is used as an input to the next stage. The matching and ranking stages aim to pass a small set of training classes for the final classification stage. This aims to reduce the classification complexity and to increase the final recognition accuracy. The effect of our matching and ranking techniques on the performance of different popular DCNN architectures is studied in the final classification phase. In Figure 2, the labels: m, L, and W represent the total number of database classes, number of classes per cluster, and 95468 VOLUME 8, 2020 number of high-rank classes that pass to the final classification stage, respectively. It is worth noting that L value is much less than m, and W is much less than L.
The contribution of the proposed multi-stage approach is summarized in the following points: 1) Applying Hierarchical Agglomerative Clustering (HAC) on the IFN/ENIT database to construct small inter-related clusters. Inter-relations between the constructed clusters support representing the database as a big search tree during the training phase. This constructed tree-like search model is used to match each test image with a cluster. Accordingly, the matching process complexity becomes is the complexity of the ordinary search, and v is the number of clusters. 2) Observing that the largest cluster includes 423 different classes, which is 49.2% of the database size. Only this matching set of database classes is involved in the next phases instead of the whole database classes to reduce computation complexity.

3) Applying Random Forest Selection technique to mea-
sure the importance of Dileep features [57] to adapt Arabic, rather than English handwriting recognition. 4) Proposing details on our new ranking approach in section III-B. The effect of different parameters on the true positive rate (TPR) is recorded. The ranking is applied on the matching set per test image to vote for the best subset of the nearest database classes. 5) Observing that the correct class is always included in the first 100 ranked classes in our experiments. That's why we applied final classification on the highest 100 ranked classes in the matching set, which is 11% of the total database classes. This means more reduction in computation complexity. 6) Improving final classification accuracy through multi staging to be around 95.6% using DCNN. This reduction in error rate is due to applying classification on only 11% of the whole database. This enhanced the final performance achieved by DCNN architectures, which is due to the effect of clustering and ranking stages. 7) Comparing the effect of the most popular DCNN architectures in the field of Arabic handwriting recognition. Before applying the matching stage, the data-mining technique is applied to the IFN/ENIT dataset to split the raw classes into clusters of similar classes, as detailed in section III-A. Based on the literature review, clustering was applied to enhance final recognition results, as reviewed in section II-G.

A. DATA-MINING PROCESS
The applied data-mining technique, shown in Figure 3, aims to represent the training section of IFN/ENIT as a big decision tree. Each node per tree-level includes a cluster of similar dataset classes. The similarity is determined based on a range of regional and geometric feature values [57].
Some constructed clusters intersect together by including some common classes. Some clusters are subsets of other bigger ones. That is why the dataset is represented as a tree-like model. This model relates smaller subsets to bigger supersets.
After building the training database model, each test image is matched against one of the constructed clusters. The Big O notation of this matching process is O(log(v)), where v is the number of clusters in the database. It is observed that the biggest cluster includes 423 different classes, i.e. the worst case is to pass only 49.2% of the database size to the second stage. Details of how the training tree-like model is built will be shown in the following three subsections. As shown in Figure 3, the database data-mining process starts by applying the preprocessing and the segmentation, as described in the next section III-A1.

1) PREPROCESSING AND SEGMENTATION
Passing testing and training database images through preprocessing and segmentation is essential for consistent postanalysis. This leads to better final recognition rates.
The database binary images are cropped and normalized to one pixel wide. This removes any variations due to using different writing tools or different handwriting styles. Image negatives are computed to concentrate computations on the word's pixels, as shown in Figure 4. After the preprocessing and segmentation process, a set of 13 regional and geometrical features is computed [57], as described in the next section III-A2. VOLUME 8, 2020

2) REGIONAL AND GEOMETRIC (REG-GEO) FEATURES EXTRACTION
After computing the 13 regional and geometrical features [57], the Random Forest feature selection technique is then applied to select the best set of effective features from these 13 features. Feature selection is based on measuring feature importance. Importance measurement depends on how much the final recognition accuracy is affected by including the feature in the recognition process. Figure 5 shows the measured importance (y-axis) of the 13 features (x-axis). Accordingly, a set of 9 features are selected to be computed in this stage, numbered 4 and 6 to 13. The x-axis value 4 is for 'Extent' regional feature, while the values from 6 to 13 are a set of geometric features [57]. These nine features satisfy best-computed importance levels. The other features with lower importance are excluded from our approach. Those excluded features labeled 1, 2, 3 and 5, are the count of piece of Arabic words (PAWs), number of holes, eccentricity, and orientation with the x-axis, respectively.
Extent is a measure of the normalized word skeleton area. It depends on the word's area and its ratio to area of its bounding ellipse, as shown in Figure 6a. Image contours are represented by the set of geometric features, numbered 6 to 13. Geometric features are computed in different six image partitions [57]. Partitioning images into six different zones aims to consider line segments' position as a feature. Eight different concavity-based features are computed to represent the normalized length and number of occurrencê N per line type in each partition. An example is shown in Figure 6b. Line types are either horizontal or vertical or left/right diagonal. Normalized lengthL and occurrenceN of each line type is defined by equations (1) and (2), respectively, where The impact of feature values on model final accuracy is shown in Figure 7. The x-axis represents the number of constructed decision trees and the y-axis represents the overall percentage of matching error (ER). The Figure summarizes the effect of the 13 features on ER, using Random Forest feature selection technique. It is clear in Figure 7a that features labeled: 1 (number of holes), 2 (PAWs), 3 (eccentricity), and 5 (word orientation) cause 21% misclassification error, while Figure 7b shows that feature 4 (Extent) reduces misclassification error exponentially from 21% to 9%. Including the geometric features labeled 6-13 to feature 4(Extent) again reduces misclassification error to 6%, as shown in Figure 7c. That is why the chosen set of features, numbered (4,6 till 13), is only concerned in this stage.
The database classes are then clustered based on the selected set of features, as described in the next section III-A3.

3) HIERARCHICAL AGGLOMERATIVE CLUSTERING HAC
The clustering technique [58] is used to structure database classes as a tree-like model. There are two different categories of hierarchical clustering algorithms: top-down and bottomup. In bottom-up, each data point is initially a single cluster. Single clusters are then merged iteratively. Agglomerate is an alternative name to the merge operation. Merge operations are consecutively done, based on a distance measure between each pair of clusters, until all clusters are included in one whole cluster in a hierarchical manner. That is why this category of clustering is known as Hierarchical Agglomerative Clustering (HAC). The root of the tree is the entire database classes included in one unique cluster.
The first advantage of this clustering technique is that the number of clusters is not necessarily become known before building the tree-like model; it is determined while building the tree. The second advantage is that the hierarchical clustering criteria fit the hierarchical structure of the database classes. The main disadvantage is the quadratic time complexity O(n 3 ) relative to other linear clustering algorithms, for example, K-Means and GMM. Although linearity would seem an attractive property in K-Means and GMM, the number of clusters should be pre-determined and there should not be any intersections between the constructed clusters. That is why HAC is preferred and more applicable to our working conditions.
Testing images are matched with one of the clusters after applying preprocessing, segmentation and feature computation as mentioned in section III-A1 and III-A2. The matching process is performed based on satisfying the defined range of feature values. Each matching cluster includes classes with the same range of the selected regional and geometric features. Different clusters/sets may include common classes. Some smaller sets may become a subset of other bigger sets that may allow the matching process to be implemented as a binary search tree, reducing complexity to O(log(v)) where v is the number of clusters in the database. The test image is now ready to be matched with a set of similar database classes included in one of the clusters. The test image and its matching cluster are both ready for the ranking stage. Algorithm 1 represents the matching process and how the clusters are related to each other in the form of a tree. The inputs to this algorithm are the test image, the database clusters, and the max cluster size. The algorithm runs iteratively on all the clusters in an ascending order to relate each cluster C with its super-cluster SC. By default, the database of all the clusters is a parent node to all the smaller clusters. Finally, the output of the algorithm is the cluster matching the input test image based on the computed nine Reg-Geo features.
The next stage III-B describes how the classes of the matching cluster are ranked relative to the input test image.

B. STAGE 2: RANKING STAGE (PHoG AND KL-DIVERGENCE)
Our ranking approach starts by computing the pyramid histogram of orientation gradients (PHoG) features [59]. These features are computed for each input image and its matching cluster's members. They represent a statistical-descriptor which was stated to achieve robustness, as reviewed in section II-B and II-D.
This statistical-descriptor is based on dividing images into segments at different resolutions called levels L, then extracting histogram of quantized oriented gradients (HOG) [60]. This form of calculation analyzes orientation information at different resolutions and levels L, as shown in Figure 8, which extracts fine details and effectively discriminate between the information of different words skeletons [37]. Representing oriented gradients by a single histogram, as a quantized feature space, provides visual vocabulary known as bag-of-words [61]. This way of feature representation is an effective way of measuring similarity. PHoG effectiveness is achieved by representing local shapes of the Arabic handwritten words, in addition to extracting the spatial layout of word shape information. The ordinary HOG (quantized into K bins) is responsible for the extraction of local shapes. Each bin represents the frequency of edges within a certain range of angular orientations. VOLUME 8, 2020 The main contribution in the PHoG compared to the ordinary HOG is satisfying the concept of spatial layout representation by segmenting the image into sub-regions. At each level L, segmented regions/grids are doubled relative to the number of grids at level L − 1, as shown in Figure 8. The x-axis represents the quantized oriented gradients and the y-axis represents their corresponding normalized histograms. The spatial distribution of edges is captured in a 1-dimensional feature vector. This vector concatenates HOG of all segmented sub-region at each level L. The feature vector length is proportional to the number of bins K and the number of levels L, as defined in (3).
Although according to literature, the most preferred gradient mask for computing the oriented gradients was the centered 1-D mask [−1, 0, 1], the Prewitt 2-D edge detector mask [62] works better in computing edge information in our case. Orientation values [0, 360 • ] are quantized to eight orientations. The effect of bins numbers and orientation range on the final true positive rate (TPR) of the system is shown in Figure 9. The figure shows an increase of the TPR when more ranks are considered. The x-axis represents the top N ranks, at N = 1,20,40,60.  Figure 9 that fine coding and increasing the number of orientation bins show better performance for a number of bins less not exceeding eight, then no improvement is achieved beyond that. Increasing the number of bins to more than eight causes overfitting in our situation. It is shown that ignoring signed gradients degrades performance, even after doubling the number of bins. As a conclusion, including sign information is essential for preserving information representing Arabic handwritten words.
The eight angular histogram bins are normalized to form the feature vector using L1-sqrt. Considering 8 bins [0 • , 360 • ], the effect of different block normalization methods [60] on the true positive rate (TPR) is shown in Table 1. It is clear in the table that L2-hys, L2-norm, and L1-sqrt achieve almost the same performance, L2-norm outperforms L1-sqrt in early top ranks (first three left columns), while the similarity between them appears in high top ranks (forth column). L1-sqrt is preferred for lower computation complexity due to passing the top 100 ranks from this stage to the next final stage. L1-norm achieves relatively lower performance, while total neglection of normalization causes great degradation as shown in Table 1. The definition of L2-norm, L1-sqrt, and L1-norm block normalization methods are defined by equations (4), (5), and (6) respectively. L2-Hys normalization is simply the original L2-norm clipped to 0.2 and re-normalized [63].
where v is the feature vector before normalization,v is the normalized feature vector, and is a small corrective constant value. At this point, handwritten words are characterized as probability distributions of local gradients. This characterization enables ranking of matching classes relative to the test image by measuring divergence between their probability distributions [64]. Accordingly, the next step in our ranking algorithm is measuring Kullback-Leibler (KL) divergence [65]. It is a generalized measure of Shannon entropy, as defined by (7). The significance of entropy in measuring information was stated [66], and complemented by KL-Divergence [67].
KL-Divergence is measured between PHoG feature vector of the test image and PHoG feature vectors of cluster's members. Based on the measured divergence, the classes of the matching cluster are ranked in ascending order relative to the test image. This process aims to pass only highly ranked cluster's classes to the final classification stage as shown in our system overview ( Figure 2). KL-divergence D KL (p(x) q(x)), defined by equation (8), is a non-symmetric measure of divergence between probability distributions. It is a non-negative measure, measured value 95472 VOLUME 8, 2020 from one distribution q(x) to another distribution p(x) differs from q(x) to p(x). p(x) = q(x) is the only case that causes divergence to become zero.

KL-divergence D KL (p(x) q(x)) (8) can be expressed as a relation in Shannon entropy [68] as shown in equation set (9)
KL-divergence D KL (p(x) q(x)) is quite effective in measuring divergence between probability distributions due to the following properties, 1) Non-negative measure, where D KL (p(x) q(x)) ≥ 0, and it equals 0 if and only if P = Q. 2) Convex measure in respect to both P and Q.
3) Satisfy Independence, that is when two independent factors X and Y are considered, then (10) is satisfied According to the measured divergence, matching classes are ranked before classifying stage. Figure 10 shows the relation between the expected TPR (y-axis) and the different ranks (x-axis). It was found that the correct class rank is always included in the top 100 ranks as shown in Figure 10a, 10b, and 10c. That is why only 100 classes, which are only 11% of the total database classes, are passed to the final stage for classification. The steps of the ranking algorithm are shown in Algorithm 2.
It is shown in Figures 10a, 10b, and 10c that faster saturation is always achieved by higher PHoG levels. The proposed approach categorized the database into three main parts:  Figure 10a of the database part labeled 30 is due to the availability of enough training samples relative to testing ones. This is a crucial property that is required in our DCNN comparative study. DCNN is a classifier that needs enough training samples for efficient training. Saturation in Figures 10b and 10c is slower than the one achieved in Figure 10a i.e., high recognition rates are not easily attainable. The final classification stage is detailed in the next section III-C.

C. STAGE 3: CLASSIFICATION USING DEEP CONVOLUTIONAL NEURAL NETWORK
Convolutional neural networks (CNNs) were introduced 20 years ago [69] for visual object detection and recognition. The usage of CNNs achieved robust feature extraction and classification, as reviewed in section II-E. Many contributions in the CNN structure were recently reported to construct deep-learned DCNNs. In deep learning, propagating information or gradient of input images through many layers may cause excluding many important parts of this information. That is why many recent publications proposed different architectures as solutions to create a short path of layers while achieving the deep learning concept. The most popular CNN architectures are Residual Networks (ResNets) [15], AlexNet architecture [70], VGG-16 architecture [13], GoogleNet architecture [14], ResNeXT [71], and DenseNet [72]. VOLUME 8, 2020 These different CNN architectures vary in number and type of layers. These variations are based on the nature of application, data size, and complexity. Different types of layers are input layer, convolution layer, batch normalization layer, pooling layer, dropout layer, and output layer [73]. Convolution process constructs feature maps using filter weights, and bias.
Designing deep architectures is a challenging task with a big set of hyper-parameters (width, filter, sizes, strides. . . etc.). Research efforts concentrated on two main architecture-based categories of CNN as shown in Figure 11, i.e. classic network architectures and modern network architectures. Classic network architectures represent the traditional CNNs, resembling the LeNet-5 architecture [74], where successive stacked convolutional layers are applied with few new modifications. LeNet-5 was the first effective CNN model that was designed to recognize handwritten digits in postal service. It outperformed its former architectures in the field of document recognition and reading bank checks. The modern architectures [15], [71], [75], [76], added more impressive contributions to the traditional CNNs. New innovative ways for constructing convolutional layers are applied to achieve more efficient learning. These architectures were applied previously in different applications and especially in text recognition, as described in the following subsections.
The architecture includes five convolutional layers in addition to some max-pooling layers and three fully-connected ones. The softmax function is applied to the output of the final layer to differentiate between the 859 classes in IFN/ENIT database. Rectified Linear Unit (ReLU) used at the convolution and the fully-connected layers. ReLU breaks up the linearity that may be caused by the convolution process. The number and size of convolutional filters are shown in Figure 12.

2) VGG-16
A new very deep architecture [13] called VGG network was introduced for offline handwriting recognition [45], [77].  Although it is deeper as shown in Figure 13, yet it is simpler than the previous architecture [70].
Simplicity was achieved by proposing a new strategy to design very deep networks based on equally sized building blocks. The increased depth is compensated by applying only small convolution filters of size 3 × 3.
Input images of size 224 × 224 are passed through a sequence of convolutional layers. Layers' details are shown in Figure 13. ReLU non-linear function is employed. Response normalization did not contribute and consumed more memory and time.

3) INCEPTION (GoogLeNet)
The concept of Inception network was first introduced by researchers at Google [14]. This type of network was applied for classification and detection purposes. The model adopts the concept of the ''Inception cell''. In the inception cell, shown in Figure 14, multiple convolution operations are applied at different scales. Input channel depth was reduced by applying 1 × 1 convolutions. For each cell, a set of 1 × 1, 3 × 3, and 5 × 5 convolution filters were used. These multiple convolutions aim to extract and merge features at different scales. Max pooling and same-padding were applied to unify dimensions of features to enable their concatenation.  The pooling layer is responsible for non-linear downsampling and reduces feature map spatial size.
Large convolution filters, for example, 5 × 5 or 7 × 7, are capable to extract features at large scale; however, they suffer from high computation complexity. An alternative solution was proposed to replace such large filters, as shown in Figure 15a. It was found that a 5 × 5 convolution filter can be replaced by two stacked 3×3 successive filters. In that way, instead of using a number of 25c parameters per 5×5×c filter, this alternative filter design requires only 18c parameters when applying two 3 × 3 × c filter with linear activation between the two filters, where c is the number of channels. For better computation complexity, 3 × 3 convolution filter is alternatively replaced by another two successive convolution filters of sizes, 3 × 1, and 1 × 3, as shown in Figure 15b.

4) ResNet
Deep residual networks (ResNet) are deeper networks with hundreds of layers instead of tens of layers. The target of building very deep networks is learning more complex data with high accuracy. Nonetheless, adding a large number of layers may cause some negative effects, and the occurrence of the degradation problem. Residual Networks (ResNet) is one of the proposed solutions to overcome degradation. This type of network is composed of residual blocks as shown in Figure 16. The identity shortcut, x, is added to the output of the applied residual mapping function to preserve the computations of the stack layers without adding extra parameters or increasing computation complexity. Different residual blocks [15] were introduced to ease training and achieve high accuracy.

5) ResNeXt
Another contribution to the deep residual network is ResNeXt architecture [71] by applying the ''split-transform-merge'' strategy instead of the standard residual block. Instead of calculating a series of convolutions over the full feature map, convolutional filters are applied in a parallel manner and merged to form feature maps.
It is a multi-branch architecture with a small set of hyperparameters. Group convolutions are applied in a parallel to save computation and processing time. The main contribution is extracting features of various characteristics by the split process.

6) DenseNet
The traditional CNNs are composed of L layers with L connections; one connection per two successive layers. Dense CNNs are composed of L(L+1)/2 direct connections. Each layer makes use of all feature-maps computed in all preceding layers. This enables such networks to outperform the previously described ones. It is considered a very powerful solution to the vanishing-gradient problem [78], by propagating feature maps, and reducing the number of parameters. Less computation is resulted while higher performance is achieved. Concatenation of all referenced feature maps from earlier successive layers constructs final feature maps. The number of filters used per layer is called growth rate k, i.e. each layer adds k more channels than its preceding layer. According to literature, DenseNet achieved better performance when compared to the ResNet models. Finally, DenseNet architecture is constructed by replacing the main unit in ResNet model architecture by the dense block, as shown in Figure 17. . System recognition rate using SVM, x-axis represents the three labeled database parts, y-axis represents the final achieved recognition accuracy.

IV. EXPERIMENTAL RESULTS
The IFN/ENIT database [29] is one of the challenging Arabic handwriting datasets. It provides training and testing images to support constructing handwriting recognition systems. It is composed of about 2200 (300 dpi) binary images of handwriting sample forms from different 411 writers. 26,000 binary word images represent town/village names that have been extracted from the forms and saved individually in five different sets. These five sets are labeled a, b, c, d and e. Formerly, sets a, b, and c were used for training, while set d was used for testing. Experiments were also achieved using sets a, b, c, and d for training, while set e for testing. The number of different classes per set is shown in Table 2. The number of samples varies from one class to another. The database provides researchers with a ground truth file for each word. We categorized the database into 3 parts. The first part includes the classes where the number of training samples exceeds the number of testing samples by 5%. The second part includes the classes where the number of training samples exceeds the number of testing samples by 20%. The third part includes the classes where the number of training samples exceeds the number of testing samples by 30%. Table 2 summarizes the statistics of the database part used in our experiments. Before testing our approach, ground truth files have been compiled and images of the same city name are grouped in a folder. By doing that, the whole dataset is organized in folders instead of having raw data images. Each folder represents a class that has a city name and contains all the images of that city written by different writers.

A. MATCHING PROCESS SENSITIVITY
There are a different number of classes per cluster. Different clusters' sizes have different frequencies. Sensitivity [79] is   Figure 19, the difference between the normalized frequency of clusters sizes, and its corresponding measured sensitivity represents 13% of the whole area under the curve. The weighted average matching set size computed from equation (11) is approximately equal to 172.6 different classes. This average is about 18% of the whole database number of classes. This causes passing only an average of 18% of the database to the next stage for ranking. (11) x is the weighted average, x is the set size and w is the normalized weights of sets. The largest set includes 423 different classes (worst case) which is 44.7% of the total database 946 training classes, as shown in Figure 19.  Table 3 shows the effect of the clustering algorithm on the performance of the Alexnet-CNN classifier when applied to the IFN/ENIT dataset. This experiment used sets a, b, and c for training, while set d was used for testing. When AlexNet was applied to the whole 946 database classes, without any clustering or ranking, it was trained in 72.1 minutes and achieved 76% recognition rate. On the other hand, when clustering was applied prior to using Alexnet and without any ranking, the recognition rate was increased to 84.1%. Finally, the outperforming recognition rate was achieved when applying clustering and ranking consecutively, before classification. It is shown in Table 3 that the training time was reduced exponentially in the final third case. The best recognition rate was achieved with the support of clustering and ranking stages.

B. THE RANKING PROCESS
For each test image, the ranking process causes the correct class to advance and occupies an early position among the cluster members. This causes faster higher recognition rates as shown in Figure 10a. Fast saturation, approximating 100% recognition rates, proves the effectiveness of the ranking process, and its positive effect on the final classification stage.
Ranking cluster's members and leading of correct classes' ranks in advanced positions enable passing only a subset of high-rank classes to the classification stage. After clustering, the maximum number of classes used for training the classifier is reduced from 845 to 423. While ranking the cluster members caused only 100 classes to train the classifier, reducing the number of participant classes in final classification improved final recognition rates from 84.1% to 95.6%, as shown in Table 4 and thus reduced computation and time complexity.

1) CLASSIFICATION USING SUPPORT VECTOR MACHINES (SVM)
The SVM is applied with different kernel types at different levels of PHoGs, as shown in Figure 18. The figure shows the final recognition accuracy (y-axis) that was reported on the three labeled database parts (x-axis). The linear SVM with the fifth PHoG level achieved the highest recognition rate. The linear kernel gave the superior classification results, while the polynomial kernel was better than the quadratic. The classification error rate was due to different reasons: visual similarity as some samples have common PAWs, as shown in Figure 21a, bad handwriting styles that may mislead the system, as shown in Figure 21b, inaccurate thinning results that may remove important details, as shown in Figure 21c and cause misclassification, and finally, lack of training samples in some database classes.  Table 5 represents similar systems' experiments and their results compared to our proposed approach using SVM. Some classification systems applied word segmentation to character level, while others did not, as shown in the table. Segmentation helps in predicting the correct sequence of word characters based on a previously defined dictionary, but it increases computation complexity, and segmentation error affects final recognition rates. The proposed approach is applied without segmentation and without use of the previously defined dictionary. Similar systems are evaluated using the word error rate (WER) metric. The approach is applied one time by using sets a, b, and c as training sets, while set d is used as testing. The approach is applied another time using sets a, b, c, and d as training sets, while set e is used as testing. It is shown in Table 5 that some approaches applied HMM classifiers. The HMM-based approaches [39] depend on character modeling, as shown in Figure 22. Each model has a defined number of hidden states and transitions. VOLUME 8, 2020

Modeling defines transition and observation probabilities.
Recognition is based on the sequence of the characters that achieve maximum likelihood. Features are extracted using a sliding window.
It is shown in Table 5 that our proposed solution is among the four highest achieved accuracy. However, the proposed approach did not model any characters, and words were recognized in a holistic manner without any character-level segmentation. Some approaches proposed a combination of different HMMs for final decision [22], [24], to improve final performance.
Other approaches [47] applied re-ranking to improve HMM performance. Still others [27] stated outperforming of gamma distributions than the Gaussian distribution in modeling states. Semi-continuous HMM (SCHMM) [23], [27], [31], achieved improved performance than the HMM. In our approach, different techniques were applied in a consecutive fashion.

2) CLASSIFICATION USING DEEP CONVOLUTION NEURAL NETWORKS (DCNN)
Applying DCNN classifier requires the existence of enough training samples. That is why it is applied in our study on IFN/ENIT dataset classes labeled 30 only. These classes were tested after training the system with their cluster members. The number of training classes per cluster is 100 based on the analysis done during the ranking stage. Figure 23 shows the achieved experiments using the different CNN architectures. The summarized experiments in the figure are based on using sets a, b, and c for training and using set d for testing. AlexNet achieved the best testing accuracy at the learning rate of 0.01. Almost all the testing samples were recognized correctly, as shown in Figure 23a. AlexNet outperformed the other architectures in time and accuracy, as shown in Figure 23a. Accuracy is measured at three different learning rates. Increasing the learning rate causes a decrease in total training time. The second best architecture is VGG-16, the highest accuracy is also achieved at 0.01 learning rate. VGG-16 shows more time and memory complexity than AlexNet. Based on Figure 23, it can be deduced that the more advanced and complex the architecture is, the lower accuracy is achieved on IFN/ENIT dataset. The clarification of this behavior is that the complexity of the CNN architecture can be compatible with data complexity. IFN/ENIT dataset is composed of binary images, and all images are thinned to one pixel wide after our early preprocessing stage. Clearly, this is a simple form of data that requires CNN of moderate complexity.
It is shown in Figure 23 that some architectures failed to classify test images at specific learning rate values. The learning rate [80] is one of the most important hyper-parameters that deserves an effort to be tuned and to reach the optimal value. Small learning rates may cause high training time and high training errors. On the other hand, large learning rates may cause a high increase in the CNN weights and this may result in overfitting. Therefore, learning rates should neither be too large nor too small. The typical values of learning rates should range from 1 to 0.000001. A comprehensive research is essentially needed to find the optimal design of a CNN for achieving the best performance [55]. Another justification for the achieved low results in Figure 23 is the deepness of some architectures that do not suit small-sized dataset. It was stated that the deeper the architecture is, the lower the identification rate is achieved [81]. Deep architectures are also sensitive to noise and image degradation [82] that may be attributed to the process of thinning images to one pixel wide.   Table 6 compares our achieved results and others' results, using different types of neural networks (NN). This comparison is based on using sets a, b, c, and d for training, and set e for testing. The CNN when applied as feature extractor and Bidirectional Long Short-Term Memory (BLSTM) followed by a Connectionist Temporal Classification layer (CTC) as a classifier on the IFN/ENIT database achieved 92.21% [34], and when CNN was applied as feature extractor and SVM as a classifier on only 56 classes [35], the recognition rate was 92.95%. When CNN was applied as a classifier [36] on Pyramidal Histogram of Characters (PHOC) features, the achieved recognition rate was 92.14%. A recent approach [49] applied a deep belief network (DBN) for feature extraction and classification, and the achieved recognition rate was 94.99%. Their enhanced results were based on character-level segmentation using a morphological algorithm, but there was an additional 6.5% segmentation error. Long Short-Term Memory (LSTM) network [48] achieved a 12.06% character error rate when applied to IFN/ENIT. A multi-column deep neural network (MCDNN) [83] achieved better recognition results in comparison to Recurrent Neural Network (RNN), when applied on IFN/ENIT.
It was always demonstrated that the achieved results using CNN features [43] are more efficient than those obtained by other hand-crafted features. Another recent approach [33] achieved superior recognition rates by applying CNN as a feature extractor followed by a dynamic Bayesian network for classification. The present multi-stage approach when applied to the database, achieved 95.6% using AlexNet CNN architecture without any character-level segmentation. Observation states the effectiveness of the proposed multistaging approach with AlexNet CNN as a classifier. Based on the literature, the superior recognition results [45] shown in Table 6 were due to building a set of Arabic characters and training the CNN on the Arabic uni-grams rather than the whole words. Classification in their outperforming approach [45] was supported by a matching dictionary.

V. COMPLEXITY ANALYSIS
The complexity of the hierarchical clustering technique in terms of database total number of classes is O(n 2 ) for space and O(n 3 ) for time. The similarity matrix is 2-D, and needs square on n to be stored in RAM, while n iterations are required for updating the similarity matrix and restoring it. After storing the database as a tree-like model of clusters along with their computed range of features, test images are matched with one of these clusters. Although the complexity VOLUME 8, 2020 of the clustering technique is very high, building the tree-like model reduces matching process complexity to O(log(v)) where v is the number of clusters. The biggest cluster includes 423 different classes; this is only 49.2% of the whole database size.
If the number of classes in a cluster is L, then KL divergence performs L operations in the second stage. This results in O(L) space and time complexity. After ranking clusters members, it was observed that the correct class maximum rank is 100, as shown in Figure 10a with the 5 th level PHoG features. This observation enables passing only a subset of 100 (maximum) classes to the classification stage. This maximum number of classes is only about 11% of the total 859 database classes. This reduction in the number of classes involved in classification causes a reduction in recognition complexity, and thus higher recognition accuracy can be expected. A comparison between different architectures on IFN/ENIT dataset is shown in Figure 24. The Figure relates the top 1 average accuracy to memory computational cost. Experiments is done on Intel(R) Core(TM) i7-2670QM processor. VGG is the most expensive architecture due to the recorded number of operations and memory complexity, as shown in Figure 24. In some other architectures reduced complexity affects accuracy negatively. The x-axis represents the required amount of operations per architecture. The required memory is represented by circles of different sizes. The memory ranges from 5MB to 155MB.

VI. CONCLUSION
The paper presents a multi-stage approach that aims to datamine big dataset before classification. The ranking stage is an intermediate stage that passes a set of nearest classes per test image for less complex final classification. For classification, SVM and CNN effect were reported independently. SVM classification is based on a feature vector, which is extracted from an input image, while CNN classification is based on the raw pixel values of an input image.

VII. FUTURE WORK
Although KL-divergence is one of the most recommended methods of measuring divergence between probability distributions based on literature [84], we plan for applying other different methods and propose a comparison between their performance. These methods include Pearson (PE) Divergence, Relative Pearson (rPE) Divergence and L 2 −Distance for divergence measure [85]. It is worth noting that these methods based on quadratic computation, while our used method is linear. Also, the relation between the different learning rates and the achieved results should be deeply studied. It is essential to implement an algorithm that computes the optimum learning rate.