Stroke Extraction for Offline Handwritten Mathematical Expression Recognition

Offline handwritten mathematical expression recognition is often considered much harder than its online counterpart due to the absence of temporal information. In order to take advantage of the more mature methods for online recognition and save resources, an oversegmentation approach is proposed to recover strokes from textual bitmap images automatically. The proposed algorithm first breaks down the skeleton of a binarized image into junctions and segments, then segments are merged to form strokes, finally stroke order is normalized by using recursive projection and topological sort. Good offline accuracy was obtained in combination with ordinary online recognizers, which were not specially designed for extracted strokes. Given a ready-made state-of-the-art online handwritten mathematical expression recognizer, the proposed procedure correctly recognized 58.22%, 65.65%, and 65.22% of the offline formulas rendered from the datasets of the Competitions on Recognition of Online Handwritten Mathematical Expressions (CROHME) in 2014, 2016, and 2019 respectively. Furthermore, given a trainable online recognition system, retraining it with extracted strokes resulted in an offline recognizer with the same level of accuracy. On the other hand, the speed of the entire pipeline was fast enough to facilitate on-device recognition on mobile phones with limited resources. To conclude, stroke extraction provides an attractive way to build optical character recognition software.


Introduction
Mathematical expressions constitute an essential part of engineering and scientific documents, dig-itizing them would maximize their usability by enabling retrieval [1] and integration to semantic web [2].Compared with natural language, mathematical expression can present some concepts more concisely because of its two-dimensional structure.At the same time, such a compact representation is more difficult to be recognized mechanically.
Enabling people to input mathematical expression using the same way they normally write on paper or blackboard is advantageous.Traditional input devices like keyboards are designed for sequence of characters, although spatial relationships between symbols can be represented by markups such as TeX or MathML, inputing mathematical expressions by typing in a computer language is not user friendly at all, as new users are asked to learn a new language and remember a lot of commands.On the other hand, entering mathematical expression with a graphical equation editor by choosing structural elements and symbols from toolboxes is inefficient for frequent users.
Turning handwritten mathematical expressions into machine manipulable syntax trees is what a recognition system expected to do.Online recognition enable people to take notes or solve equations by writing on a touch-based device or dragging a mouse.On the other hand, offline recognition enable people to digitize existing manuscripts by scanning or record lecture notes on blackboards by taking photos.The major difference between two kinds of recognition is that temporal information(coordinate and possibly pressure at each moment) of strokes is available to an online recognizer while only bitmap image is provided to an offline recognizer.
It is not surprising that online handwriting recognition often achieve a much higher accuracy, given the fact that an online recognition problem can be trivially reduced to the corresponding offline problem by rendering the strokes.In the opposite direction, if sequence of strokes can be recovered from a bitmap image, an online recognizer can also be applied to do offline recognition [3].Being a computer vision procedure, recovery of strokes is unlikely to be perfect, so tolerance to errors is required for the underlying online recognition system.This should not be a strong constraint since diversity of stroke order already affected online recognition systems.
The proposed approach is especially preferable for real-time use cases on devices with limited resources.Online recognition engines often occupy less memory and run faster than native offline recognition engines.Optical recognizers based on convolutional neural network typically depend on large models, doing prediction is computationally intensive too.It is possible to offload the computational burden to a cloud, but it is difficult to ensure that the time constraints are met due to network delay.
Although this paper is targeting mathematical expression, the idea reducing an offline recognition problem to its online counterpart is quite general.In principle, the same approach can be applied to other handwriting such as chemical expression, musical notation and diagram.Since the idea allows advances on online recognition to be propagated immediately to the offline case, developing independent recognition systems for online and offline handwriting may no longer be necessary.Instead, online recognition system makers can enter the offline market without abandoning existing investments.

Related works 2.1 Online handwritten mathematical expression recognition
Online handwritten mathematical expression recognition is a long-standing problem, a lot of works have been done since Anderson [4].In the past decade, the problem attracted more and more attention thank to the benchmarking datasets released by the Competitions on Recognition of Online Handwritten Mathematical Expressions(CROHME) [5], which had been held in 2011, 2012, 2013, 2014, 2016 and 2019.
Traditionally, the problem is further divided into symbol recognition and structural analysis [1].For example, Álvaro et al. [6] used hidden Markov model to recognize symbols and parser for a predefined two-dimensional stochastic context free grammar to analyze the structure.
Yamamoto et al. [7] suggested to parse handwritten mathematical expression directly from strokes using the Cocke-Younger-Kasami algorithm.Therefore, symbol segmentation, character recognition and structural analysis can be optimized simultaneously.Awal et al. [8] introduced another global approach that apply a segmentation hypothesis generator to deal with delayed strokes.
Recently, with the advances in deep learning and computational power, Zhang et al. [9] proposed an end-to-end trainable neural network with an attention mechanism for online mathematical expression recognition.The framework also used offline information by appending rendered image into the input.

Offline handwritten mathematical expression recognition
In contrast, dedicated work on offline handwritten mathematical expression recognition is almost blank in literature until very recently.An offline task was first added to CROHME in 2019 [10].
In the past, the closest problem addressed is the more constrained problem of printed mathematical expression recognition.Again, in a typical system, symbols are first segmented and recognized, then the structure of the expression is analyzed [11].For instance, in the system developed by Suzuki et al. [12], symbols are extracted by connected component analysis and then recognized by a nearest neighbor classifier, finally structural analysis is performed by finding a minimum spanning tree in a directed graph representing spatial relationships between symbols.
Recently, Deng et al. [13] and Zhang et al. [14,15] developed end-to-end trainable neural encoderdecoder models to translate image of mathematical expression into TeX code.It should be noted that this method is so general that it can be applied to any image-to-markup problem, grammar of neither mathematical expression nor TeX is given to the systems explicitly because they are learned from data.

Stroke extraction
Stroke extraction was studied for offline signature verification [16] and East Asian character recognition [17].A typical stroke extractor detect candidates of sub-strokes first and then reassemble them into strokes by resolving ambiguities.Substrokes can be detected by breaking down the skeleton or approximating the image by geometrical primitives such as polygonal chains.
Lee et al. [16] designed a set of heuristic rules to trace the skeleton.Boccignone et al. [18] tried to reconstruct strokes by joining the pair of adjoining sub-strokes having smallest difference in direction, length and width repeatedly.Doermann et al. [19] proposed a general framework to integrate various temporal clues.
Jäger [20] reconstructed strokes by minimizing total change in angle between successive segments in a stroke.Lau et al. [21] used another cost function taking distance between successive segments and directions of the segments into account.Unfortunately, this kind of formulations is essentially traveling salesman problem which is NP complete, so optimum may not be computed effectively if there are more than a few sub-strokes.
In order to prevent explosion of combinations, Kato et al. [22] restricted themselves to single-stroke script subjecting to certain assumptions on junctions, where strokes can be extracted by traversal of graph.Nagoya et al. [23] extended the technique to multi-stroke script under assumptions on how strokes are intersected.
Nevertheless, the effectiveness of existing stroke extraction methods to offline recognition is untested or not evaluated on a common dataset, so it is difficult to judge their performance.
3 Offline to online reduction

Overview
Given a bitmap image containing a mathematical expression, it must be converted to a sequence of strokes before being passed to an online handwritten mathematical expression recognition engine.In more detail, key steps of the proposed offline handwritten mathematical expression recognition system are: 5. Construction of an attributed graph.Segments and junctions form edges and vertexes of the graph respectively.
6. Simplification of the attributed graph.Remove vertexes and edges which are likely noises from the graph.
7. Reconstruction of strokes.Merge segments into strokes using a bottom up clustering.
8. Fixing double-traced strokes.Reuse some segments to join separated strokes.
9. Determination of stroke direction.Ensure that the points in each stroke are ordered by the time they are likely to be written.
10. Stroke order normalization.Sort the strokes according to when they are expected to be written.
11. Online recognition.Use any online handwritten mathematical expression recognition engine to recognize the sequence of strokes extracted.

Preprocessing
Since skeleton roughly preserve the shape of strokes but much simpler, it is easier to trace strokes from the skeleton instead of the full image.Before skeletonization, colored image should be binarized.A colored image is first converted to a grayscale image by averaging the color channels(possibly weighted).Among the large number of binarization methods available, Sauvola's method [24] is chosen.
Compared with global thresholding such as Otsu's method [25], such an adaptive approach addressed commonly seen degradations including uneven illumination and random background noises.However, pixels that do not belong to the mathematical expression may still be marked foreground, text next  to the expression and grid lines on a note book for instance.Mathematical expression localization and separation [26] can be used to tackle the problem, but they are out of the scope of this paper.
After binarization, skeleton of the image is obtained by using a thinning method by Wang [27], which is a variety of the original method by Zhang-Suen [28] but better preserve the shape of diagonal strokes.Figure 1 compared an image with its skeleton.
For printed document recognition, skew detection and correction are often performed.However, they should not be applied to a single mathematical expression because the number of symbols may not be enough to estimate the angle reliably.To make thing worse in the present situation, symbols from a handwritten formula need not stick to a single baseline, so expressions like x x x may fool skew estimations based on line detection like Hough transformation.

Decomposition of skeleton
After skeletonization, the skeleton is decomposed into segments and junctions, thus the skeleton can be viewed as a graph.
A foreground pixel having exactly two foreground pixels in its 8-neighborhood and these two pixels are not 4-neighbor of each other is called a segment pixel.Other foreground pixels are called junction pixels.Figure 2

illustrated the rules.
A connected component of the set of segment pixels is called a segment, while a connected component of the set of junction pixels is called a junction.The set of segments and the set of junctions can be    3: Segments and junctions computed using any standard algorithm for connected component analysis [29].In Figure 3, segments are filled but junctions are not.
For each segment S i , its pixels can be listed in a way such that successive pixels are 8-neighbor of each other, more formally, S i = {p i,1 , . . ., p i, i } where p i,k is in the 8-neighborhood of p i,k−1 for k = 2, . . ., i .If p i1 is in the 8-neighborhood of p i, i , the segment is topologically a circle unless i = 1 and do not touch any other junction or segment; otherwise, the segment is topologically a line segment, p i1 touch exactly one junction and so do p i, i , other pixels in the segment never touch any other segment or junction.
For the sake of consistency, a "junction" is imposed to each looped segment to ensure that every segment has a start pixel and an end pixel, in addition, each touch a junction.Therefore, a junction can be considered as a vertex in the sense of graph theory, while a segment can be considered as an edge connecting two (possibly the same) vertexes.Furthermore, a path in this undirected graph corre-Figure 4: Stroke width is the minimum of the four directional run lengths spond to a possible trace of ink in the input image, a connected component in this graph correspond to a connected component of the skeleton.Figure 5a shows the graph coming from the same example as in Figure 3.

Noise reduction
Subtle features such as salt and pepper noises in the input image can affect the skeleton, salt noises result in really short segments while pepper noises result in isolated junctions.In addition, thinning may introduce distortions.Since they can distract stroke extractor and recognition engine, they should be discarded from the graph.Absolute threshold is not used because that will not work for all resolutions.Observed that stroke width is uniform in a piece of handwriting, it is chosen to be a reference length.
Stroke width transform is an image operator that assign an estimated stroke width for each foreground pixel.
It was proposed for scene text detection [30] where strokes are considered as contiguous pixels having approximately constant stroke width locally.Using a straightforward viewpoint, stroke width of a pixel can be estimated by the minimum length of the four directional runs [31] passing through it as shown in Figure 4, where squares represent foreground pixels and arrows represent run lengths of the pixel filled.Under the above definition, stroke width transform can be computed in linear time with respect to the size of binary image by caching the number of successive foreground pixels found in certain directions.
For each set of pixels, its width is estimated by  the maximum stroke width among its pixels.Furthermore, the tip size of the pen is estimated by the average stroke widths over all the segments.Now, the rules being used to reduce noises can be stated: 1.For each edge with a length smaller than a multiple of the average stroke width, remove it from the graph and merge its end points.
2. For each vertex with degree 0 and a width less than a multiple of the average stroke width, remove it from the graph.
Figure 5b shows the simplified graph coming from the same example as in Figure 3.

Stroke tracing
Clearly, an isolated vertex in the graph represents a dot in the mathematical expression, possibly a decimal point or part of a character like "i".Therefore, a stroke containing a single point is extracted for each vertex with degree 0. In addition, a path in the skeleton graph indicate a candidate of stroke .Although there may be multiple ways to combine the edges into paths, some combinations are more likely to form strokes of a mathematical expression written by human being.Here are some heuristic principles: • The total number of strokes should be minimal.Since letting the pen to leave the paper requires additional time, an unicursal way is preferred.
• The difference of directions between two successive segments should be as small as possible.Subjecting to these considerations, each edge is assigned to exactly one path by a bottom up clustering.Initially, each edge form a path on its own.While there is a pair of paths having a common end point, choose a pair such that the angle between them is minimum, then merge them into one path.Repeat the procedure until no path can be merged.
It should be noted that the two principles may not always agree.If the number of strokes is considered more important, its minimum can be obtained by merging each path with circuits that have a common vertex with it, just like the algorithm that search for an Eulerian path.

Fixing double-traced strokes
Sometimes, a segment should be shared by more than one strokes or appeared in a stroke multiple times due to reentry during writing as shown in Figure 6b.The tracing procedure above would handle such cases incorrectly by producing too many strokes as in Figure 6a.
In order to fix the double-traced strokes, a search for shared segments is needed, so that they can be used to reconnect separated strokes.Candidates of shared segments should meet the following criteria: • The segment has two different end points and they are vertexes in the graph with odd degree.
Otherwise, the number of vertexes in the graph having odd degree do not decrease when it is doubled.
• Each end point of the segment is also an end point of a path given by subsection 3.5 and the angle between them is not close to π/2.This condition can prevent the two strokes of the symbol "T" from being merged.

Stroke order normalization
The stroke tracing procedure give rise to an ordering of points inside a stroke naturally, however, the opposite ordering may also make sense.Since people usually write from left to right and top to bottom, a simple rule is sufficient to determine the direction of each stroke in most cases.Let the coordinates of the first and the last point of a stroke be (x start , y start ) and (x end , y end ) respectively, then the list of points should be reversed if 2x end + 3y end < 2x start + 3y start .Finally, the ordering that people write down the strokes need to be recovered.There are possibly multiple ordering to write down the same formula.For example, someone prefer to write down the square root sign first and the others write down the radicand first.Therefore, it is not always possible to recover the original ordering, what can be done is to assign a reasonable ordering.
A hierarchical approach is applied to sort the strokes.At first, strokes are grouped by recursive projection, then the groups are sorted in a left to right and top to bottom manner.After that, strokes inside each group are sorted by a topological sort, where a stroke T i precede another stroke T j if one of the following conditions hold: • T i is on the left of T j , where their projection to y-axis(but not x-axis) intersected; • T i is on top of T j , where their projection to x-axis(but not y-axis) intersected.
Further ambiguities are resolved by using the coordinates of the top left corner of the bounding boxes.Figure 7 illustrated how strokes are sorted, in which groups are separated by dotted lines and precede relationships are represented by arrows.

Datasets
In order to evaluate the proposed system, datasets from the Competition on Recognition of Online Handwritten Mathematical Expressions (CROHME), which are rather standard in handwritten mathematical expression recognition, is used.The system is evaluated on both test set of task 2(mathematical expression recognition) in CROHME 2014 [32] and test set of task 1(formula recognition) in CROHME 2016 [33], because results of some other systems are only available on one of them.The former contains 986 expressions and the later contains 1147 expressions.
For each mathematical expression in a dataset, list of points in each stroke is provided together with ground truth.In addition to MathML representation of the expression, ground truth also include correspondence between symbols and strokes.Since a bitmap image of mathematical expression can be obtained by rendering the strokes, the datasets can be used to evaluate offline recognition system as well.Following the settings of task 2(offline handwritten formula recognition) in CROHME 2019, formulas are rendered at resolution of 1000 × 1000 pixels using the script provided by the organizers.
However, it should be noted that rendered image different from scanned or camera captured mathematical expression in the level of background noise.We are not able to evaluate our system on such real world images because no standard dataset of annotated images of scanned or camera captured handwritten mathematical expression is publicly available up to our knowledge.As a consequence, the effect of binarization to the overall performance is still not well tested.

Performance of stroke extraction
In order to evaluate the performance of the stroke extraction procedure ignoring ordering, for each expression in the dataset, the strokes are painted onto a bitmap image, then the stroke extraction procedure is applied, after that the recovered strokes are compared against the original strokes.Two strokes are considered matched if and only if the Hausdorff distance between them is less than a multiple(4) of the stroke width used in rendering.Experimental results are shown in Table 1.Over half of the expressions have all their strokes correctly extracted.
In order to evaluate the performance of stroke direction detection, the rule is checked for each stroke in the two datasets.Experimental results are shown in Table 2, majority of strokes are assigned the direction as in ground truth.
In order to evaluate the performance of the stroke order normalization procedure, for each mathematical expression, the strokes are permuted randomly before the stroke order normalization procedure is applied, then the generated ordering is checked against the ground truth.Experimental results are shown in Table 3. Manual inspection suggest that most of the generated orderings are acceptable.Significant errors can be divided into two types: • Wrong grouping due to recursive projection.
For example, the algorithm think that the subscript "n" in the expression lim n→∞ 1 n as shown in Figure 8a was written before the character   "i" because "n" is in the first horizontal group together with "l".
• Misleading precede relationship where a symbol is on the top right of another.For example, the algorithm think that the subscript "2" in the expression √ a 2 − a 1 as shown in Figure 8b was written after the operator "-" because the "2" is under the minus sign.
If both the set of strokes and their ordering are taken into account, the sequence of strokes extracted from 18.86% and 19.88% of the rendered expressions from CROHME 2014 and 2016 matched the ground truth respectively.However, the sequence of strokes given by the ground truth may not be the only correct way to write down a formula, so the strokes extracted are acceptable most of the time.In fact, a paper [34] claimed that only 851 out of 986 expressions from CROHME 2014 have correct ordering of strokes.

Performance of offline recognition
In order to evaluate the overall performance of offline handwritten mathematical expression recognition, for each formula in a dataset of handwritten mathematical expressions, paint it onto a bitmap image, then stroke extraction is applied, after that the detected strokes are passed to version 1.3 of MyScript Math recognizer, the winner of CROHME 2016 [33].
Aligning with the offline task in CROHME 2019 [10], expression level metrics computed from the symbol level label graphs of formulas are used to evaluate the proposed system.Structure rate measure the percentage of recognized expressions matched the ground truth if all the labels of symbols are ignored.Expression rate measure the percentage of recognized expressions matched the ground truth up to a certain number of labeling mistakes on symbols or spatial relationships.On the other hand, stroke classification rate, symbol segmentation rate, symbol recognition rate and metrics based on stroke level label graph are inapplicable because offline recognition do not produce correspondence between symbols and strokes.
Experimental results on CROHME 2014 test set are shown in Table 4.The first seven are online recognition system participated in CROHME 2014 [32].The proposed system outperformed all participated systems in CROHME 2014 except MyScript.Since MyScript itself has evolved over the past few years, MyScript Interactive Ink version 1.3, which is up-to-date as of this writing, is evaluated in the online setting too.On the other hand, Harvard [13], WAP(Watch, Attend and Parse) [14] and MSA(MultiScale Attention) [15] are also offline recognition systems, the proposed procedure achieved a better performance than them.
Experimental results on CROHME 2016 test set are shown in Table 5.The first five are online recognition systems participated in CROHME 2016 [33].As expected, Myscript is better than the proposed system in all of the metrics because the later is based on the former and original strokes are available to the former.The proposed system significantly outperformed all the remaining participated systems in CROHME 2016.On the other hand, WAP [14] and MSA [15] from USTC are offline recognition systems, the proposed system outperformed both of them on this dataset.
The proposed system participated in CROHME 2019 [35] where there was 1199 expressions in the test set.The provisional results are shown in Table 6, where online and offline systems are listed separately.It should be noted that the results may change after the competition because of corrections to the ground truth.The proposed system was

Discussion
In order to find out factors that may affect the accuracy of recognition, experimental results on CROHME 2016 are examined more carefully.

Grammar
Table 7 shows performance of recognition given different grammars of mathematical expression.In the experiment, MyScript is customized with the official grammar provided by CROHME, the default grammar of MyScript, and a customized grammar obtained by removing unused terminals and production rules from the default one.The customized grammar produced a slightly better result than the default one by eliminating candidates of symbols and constructs that never appeared in the dataset.However, the official grammar was counterproductive, an explanation is that such a restrictive grammar do not allow some neighboring symbols to be combined at an early stage.The observations suggest that one need to strike a balance between language model and geometric layout model.

Complexity of expression
Figure 9 shows that recognition rates decrease as the number of symbols inside expression grows in general, the same trend is observed in online recognition too [5], since more and more symbols and spatial relationships are required to be recognized correctly.However, the group of expressions having the lowest complexity do not achieved the highest accuracy, since contextual information is useful to distinguish symbols.Formulas having more than 24 symbols are not showed in the figure because the sample size is too small.

Accuracy of stroke extraction
Figure 10 shows how the accuracy of stroke extraction ignoring ordering affect the recognition rates, F-score is used to measure the accuracy of stroke extraction on a formula.The group of expressions having all strokes correctly extracted achieved the best recognition performance as expected, since an online recognizer make use of temporal information.However, as stroke extraction become inaccurate, the recognition rates keep rather steady, the fact indicated that the underlying online engine is robust to different ways of writing, so a stroke extractor that cannot always recover the original strokes exactly is still useful for offline recognition.
In some cases, mistakes made by the stroke extractor can also be viewed as a kind of normalization and may in fact enhance performance by eliminating unusual stroke order [36].

Future works
Although promising accuracy on offline recognition is achieved, the rates are still approximately 10 percentage points lower than that of the underlying online recognition engine.The fact indicate that the proposed system still suffer from bias between extracted strokes and written strokes.In particular, strokes from touching symbols may not be well segmented, ambiguities at junctions may not be resolved correctly, and ordering of strokes may not be conformed to all conventions.In order to overcome these weaknesses, the following works can be done in the future: • Development of data driven stroke extractors based on machine learning models such as con-volutional and recurrent neural network.Developers of note taking programs and input methods on touch-based or pen-based devices are able to collect a massive amount of online handwriting, such data can cover more cases than predefined heuristic rules in the long run, so the data may give rise to a more accurate stroke extractor.
• Use extracted strokes instead of the written strokes to train an online recognizer.Therefore it is optimized to strokes given by the stroke extractor.By applying this technique, the sense of stroke extracted is decoupled from the way written by human beings, thus stroke extractor can be simplified by ignoring corner cases.

Conclusion
Stroke extraction can be applied to build an offline handwritten mathematical expression recognition system upon an online one.A proof-of-concept implementation of the proposed stroke extractor is publicly available as a free software 1 .The proposed procedure produced good results on benchmarking datasets taken from CROHME, while consumption of resources is fairly low.Accuracy can be further improved if the underlying online recognition engine is trainable.Therefore, the potential of reduction from offline recognition to online recognition is justified.In general, the methodology is applicable to other types of handwriting such as chemical expression, musical notation, and diagram.

1 . 2 .
Adaptive binarization.Convert possibly colored input image into black and white image.Stroke width transform.Estimate stroke width for each foreground pixel.3. Thinning.Convert binary image into skeleton.4. Decomposition of the skeleton.Break it down into segments and junctions.

Figure 1 :
Figure 1: Thinning (a) Centers of these windows are segment pixels (b) Centers of these windows are junction pixels

Figure 2 :
Figure 2: Segment pixels and junction pixels

Figure
Figure 3: Segments and junctions

Figure 7 :
Figure 7: Stroke order normalization Left-to-right versus top-to-bottom

Figure 9 :
Figure 9: Relationship between number of symbols in an expression and the accuracy of recognition

Figure 10 :
Figure 10: Relationship between the performance of stroke extraction and the performance of offline recognition

Table 1 :
Performance of stroke extraction

Table 2 :
Performance of stroke direction detection

Table 3 :
Performance of stroke order normalization

Table 4 :
Recognition performance on CROHME 2014 test set

Table 7 :
Recognition performance on CROHME 2016 test set given different grammars