CArDIS: A Swedish Historical Handwritten Character and Word Dataset

This paper introduces a new publicly available image-based Swedish historical handwritten character and word dataset named Character Arkiv Digital Sweden (CArDIS) (https://cardisdataset.github. io/CARDIS/). The samples in CArDIS are collected from 64,084 Swedish historical documents written by several anonymous priests between 1800 and 1900. The dataset contains 116,000 Swedish alphabet images in RGB color space with 29 classes, whereas the word dataset contains 30,000 image samples of ten popular Swedish names as well as 1,000 region names in Sweden. To examine the performance of different machine learning classifiers on CArDIS dataset, three different experiments are conducted. In the first experiment, classifiers such as Support Vector Machine (SVM), Artificial Neural Networks (ANN), k-Nearest Neighbor (k-NN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Random Forest (RF) are trained on existing character datasets which are Extended Modified National Institute of Standards and Technology (EMNIST), IAM and CVL and tested on CArDIS dataset. In the second and third experiments, the same classifiers as well as two pre-trained VGG-16 and VGG-19 classifiers are trained and tested on CArDIS character and word datasets. The experiments show that the machine learning methods trained on existing handwritten character datasets struggle to recognize characters efficiently on the CArDIS dataset, proving that characters in the CArDIS contain unique features and characteristics. Moreover, in the last two experiments, the deep learning-based classifiers provide the best recognition rates.

datasets have several limitations: 1) the lack of availability of samples in Latin and Swedish languages; 2) the samples are generally from relatively recent document images with mild degradations; and 3) the samples are primarily written with ballpoint pens and in modern handwriting styles. Therefore, to alleviate these limitations, we introduce a new publicly available image-based historical handwritten character and word dataset, named CArDIS. The samples in CArDIS are from 64, 084 Swedish historical handwritten birth record documents (e.g. Fig. 1) written by several anonymous priests with various inks, nibs, and dip pens. The CArDIS consists of character and word datasets comprising 116, 000 single-character images in Latin and Swedish alphabets with 29 classes and 30, 000 Swedish names with 10 classes as well as 1, 000 region names. The experiments demonstrate that machine learning methods trained on existing character datasets and tested on the CArDIS character dataset, provide low recognition accuracy. Thus, it is necessary to create a new handwritten character and word dataset for historical handwritten text recognition. As a summary, the main contributions of this work are as follows: • Introducing a new handwritten historical character and word image dataset named Character Arkiv Digital Sweden (CArDIS) (publicly available from: (https:// cardisdataset.github.io/CARDIS/). • The CArDIS is the first publicly available Swedish handwritten character and word image dataset. • The CArDIS consists of 116, 000 letters with 29 classes, 30, 000 Swedish female and male names with 10 classes and 1, 000 region names. • An extensive analysis of machine learning methods on created dataset and existing Handwritten Character Recognition (HCR) datasets is carried out. • Examining the similarities and differences between created CArDIS character dataset and existing character datasets which are Extended Modified National Institute of Standards and Technology (EMNIST), IAM and CVL.

II. RELATED WORK
OCR is one of the leading research topics in pattern recognition, and it has been widely used to recognize handwritten or machine-printed characters in document images collected from heterogeneous sources [10]. Generally, the existing OCR systems include four main steps comprising pre-processing, segmentation, feature extraction, and recognition [11]. The first step aims to eliminate undesired artifacts or characteristics in a document image using binarization, skew correction, and denoising techniques. The second step aims at isolating text elements in a document image from the image's background. Usually, this process starts with line segmentation, followed by word segmentation, and, if necessary, ends with character segmentation. In the third step, various features from character/word images are extracted. In the last step, the extracted features are fed into a classifier to identify characters/words. Here, instead of reviewing the vast body of literature on OCR, we discuss commonly used machine learning methods in alphabetic character and word recognition. A comprehensive survey of OCR methods can be found in [12]- [17].

A. ALPHABETIC CHARACTER RECOGNITION
In the optical alphabetic character recognition, statistical classifiers such as Hidden Markov Model (HMM), Decision Trees (DT), k-Nearest Neighbor (kNN) have been widely used. For instance, [18] propose an HMM-based alphabetic character recognition system for Greek Polytonic on historical documents. Firstly, in this approach, geometric and Principal Component Analysis (PCA) features are extracted from character images. Next, a Gaussian Mixture Model (GMM) is used to model the feature vector. Lastly, alphabetic character images are classified using HMM through the probability calculated by the GMM. This method obtains a character error rate of 8.61%. [19] develop a method for recognizing Telugu handwritten alphabet characters. In this approach, characters are segmented from palm leaves, and then a DT algorithm is used to recognize the character images with an overall accuracy of 93.10%. Amongst all the statistical classifiers, kNN is one of the most used machine learning approaches in OCR systems [20], [21]. For example, [22] proposes a kNN-based machine learning method for Lanna Dharma handwritten alphabet character recognition on palm leaves manuscript images. To achieve this, firstly, two different wavelet transforms, and region properties are used to extract 3 different features from input alphabet character images. Then, the kNN classifier is adopted for recognition and achieves an accuracy of 95.48%. Many other kNN-based OCR methods have also been proposed [23], [24]. In OCR, another recognition approach is the use of Support Vector Machine (SVM) technique. For instance, [25] introduce an SVM-based OCR system for English handwritten character recognition. The proposed model starts with using a thinning pre-processing algorithm to produce unique skeletons representing the original handwriting characters. Then, to extract features, Freeman chain code is used. Finally, the English character images are classified using an SVM with a radial basis kernel function. The machine learning methods are trained and tested on NIST dataset [1] and the average classification accuracy of 86% is achieved. [26] develop a handwritten character recognition system for Gurmukhi alphabets. Firstly, the method extracts horizontal and vertical projection features. Then, the feature vector is fed into SVM classifier with linear and polynomial kernel functions. This approach obtains an average accuracy of 97.4%. Moreover, many other SVM-based OCR systems have been proposed to recognize alphabetic characters in different languages [27], [28].
Artificial Neural Networks (ANNs) is another widely used classifier in OCR. For example, [29] presents a character recognition method for broken Kannada characters using ANN. This method consists of three steps. Firstly, an end point technique is applied to reconstruct broken characters. Secondly, zonal features are extracted from character images. Finally, ANN is used to recognize character images and obtains recognition accuracy of 98.9%. [30] introduce a handwritten character recognition system using ANN to classify English handwritten letters. This method has three phases. Firstly, the English handwritten character images are converted into binary images. Secondly, the binary image is segmented into individual characters, and then each character image is resized into 30 × 20 pixels. Finally, the character images are classified and recognized using an ANN classifier with an overall accuracy of 94.15%. [31] propose an ANN-based OCR system for recognition of handwritten characters of the English language. In this approach, binary characters images are recognized using a multi-layered ANN classifier and deliver average classification accuracy of 85.62%.
In the last decade, Convolutional Neural Network (CNN) has achieved outstanding performance in character recognition. For instance, [32] propose a two-stage CNN method to detect and classify tight Chinese characters in historical documents. To achieve this, two simultaneously CNNs are used. The first CNN aims to localize characters with bounding boxes, whereas the second CNN, based on VGG-16, aims to recognize characters in each bounding box. [33] design a CNN-based OCR system for handwritten Arabic character recognition. The CNN consists of three convolutional layers and a fully connected layer. The CNN model is tested on two different Arabic handwritten character datasets and achieves average recognition accuracy of 94.7% and 94.8%, respectively. [34] develops a model focused VOLUME 4, 2016 on integrating CNN and SVM for Arabic handwriting recognition. A CNN is used for extracting image features in this approach, and SVM is utilized as a recognizer. The CNN architecture is tested on IFN/ENIT [35] database and achieves an error rate of 7.05%. [36] propose a CNN-based OCR method to recognize Arabic handwriting characters automatically. The CNN architecture consists of three convolutional and two fully connected layers. This method achieves an accuracy of 97%. Many other deep-learning-based OCR frameworks have been designed to achieve a high accuracy rate for different handwritten character datasets [37]- [39].

B. WORD RECOGNITION
Generally speaking, in word recognition, two main types of strategies have been applied: 1) analytical approach and 2) holistic approach. A word must first be segmented into units such as letters, graphemes, strokes, or pseudo-letters in the former one. Then, the word units are recognized using a machine learning algorithm. Finally, the likelihood for each word in the lexicon can be estimated to recognize the word. For instance, [40] propose a hybrid handwritten word recognition approach based on ANN and HMM. First, a slicing technique is used to build a graph which shows all possibilities to segment a word into letters. Then, ANN is utilized to compute probabilities for each letter in the graph. Finally, HMM is used to classify the words. For evaluation, a French word database has been used, namely IRONOFF [41] handwriting database, and the system achieves an accuracy of 99.1%. [42] develop a CNN-based method to recognize words in RIMES dataset. A CNN architecture is used first to measure and then re-sample an input word image to a canonical representation in this approach. Then, a fully connected CNN architecture is designed to predict the characters. Finally, the words are recognized by a vocabulary-matching method. Another CNN-based method is proposed in [43]. In [43] an attention-based sequence-to-sequence model is used for handwritten word recognition in IAM dataset. To form encoder stage, the ResNet feature extraction is combined with bidirectional LSTM. Then, to predict words, a decoder is integrated with a content-based attention mechanism. [44] proposes another attention-based encoder-decoder model to recognize handwritten text using sequences of characters, extracted from IAM dataset. In another work [45], an attention-based method combines CNN Recurrent Neural Networks (RNN) encoder with an RNN-decoder to recognize line or word. In this method, the encoder extracts features from the handwritten texts and sequentially encodes temporal context. Then, the decoder recognizes a character one by one, using an attention mechanism. In [46], a Generative Adversarial Network (GAN) architecture is proposed to recognize handwritten words qt character levels using IAM dataset. The method consists of two main steps. The first step is a discriminator which consists of a path signature features extractor and a CNN-LSTM binary classifier to distinguish realistic and forgery handwritten data. The second step is a generator used to produce random handwritten characters.
In the holistic approach, word recognition is performed on the whole representation of words, without segmenting them into units (e.g. letters). In this manner, [47] propose a holistic-based lexicon reduction method to recognize 200 region names written in Farsi/Arabic language. First, the words holistic features such as single, double, and triple dots, their order from left to right and their up or down position in a word are extracted. Then, the extracted features are fed into an HMM model to recognize words. [48] present a handwritten Arabic word recognition system. The system uses Pseudo Zernike Moments as a feature extraction technique. Then, the HMM is used as a classifier. The OCR framework is tested on 100 Arabic names and provides 88% of the word recognition accuracy. [49] propose a word recognition method based on HMM. In this method, three feature sets based on black-and-white transition, image gradient, and contour chain code are employed. Then, each of those is modeled with an individual HMM. In the recognition step, the outputs of the HMMs are combined using a multi-layer perceptron. The method is tested on the Iranshahr 3 dataset and provides 89% recognition accuracy. [50] propose a word recognition system based on a 12-layer CNN and canonical correlation. The method is tested on 3 different datasets such as IAM [51], RIMES [52] and IFN/ENIT [35] and achieves error rate of 6.45%, 3.9%, and 3.24%, respectively.
Different handwritten character and word datasets with different languages have been created from handwritten document images. The generated datasets are used for developing machine learning based OCR models. Extensively used and the generated CArDIS handwritten character and word datasets are tabulated in Table 1 and explained below.
The EMNIST letters dataset [1] consists of 145, 600 isolated handwritten letter images with 26 balanced classes. The letters in the dataset was collected from handwritten documents written in English language by 3, 600 writers. This dataset contains gray-scale letter images which are size-normalized and denoised. It is publicly available to the research community. The QUWI dataset contains a handwritten dataset in Arabic and English languages written by 1, 017 volunteers of different ages, nationalities, genders, and education levels [53]. This dataset is mostly used for developing writer identification systems. The dataset has 4, 068 document images and consists of 60, 000 words in Arabic and 100, 000 words in English languages. The dataset is available upon request. The IAM dataset contains 5, 685, 13, 353 and 115, 320 isolated and annotated handwritten sentences, text lines, and handwritten word images, respectively [51]. In this dataset, the words were automatically collected from the handwritten document images using hidden Markov model (HMM) based automatic segmentation model [51] and were verified manually. All the images in this dataset are scanned with a resolution of  [54]. This dataset is publicly available and it is used for writer identification and word spotting. The CEDAR dataset [55] comprises of 10, 570 handwritten words in gray-scale. This dataset contains 12, 821 isolated handwritten uppercase letters and 8, 487 isolated lowercase letters in binary. The database is imbalanced, which was collected manually from the scanned USA mails. This dataset is not publicly available. The IRONOFF online/offline handwritten dataset contains isolated French characters, digits, and cursive words [41]. This dataset was extracted from approximately 1, 000 digitized forms written by French writers.

III. EXISTING HANDWRITING DATASETS A. HANDWRITTEN CHARACTER AND WORD DATASETS
In addition to the Latin handwritten character and word datasets, other handwritten character and word datasets have been generated in different languages. Several of these datasets are explained and described below.
The IFN/ENIT [35] is an Arabic handwritten character dataset that consists of 212, 211 handwritten Arabic characters and 26, 459 words. It was created to develop Arabic OCR systems. This dataset was created from 2, 200 binary handwritten document forms written by 411 different writers, including names of towns and villages in Tunisia. The dataset is publicly available. The KHATT dataset [56] is another Arabic handwritten dataset that contains 1, 000 handwritten forms written by 1, 000 different writers. All the handwritten forms are scanned at 200, 300, and 600 dpi resolution, and they are pre-processed using OTSU's method to convert the handwritten images into binary images. The dataset is publicly available. The Chars74K dataset [58] composes of 7, 705 handwritten, 3, 410 hand-drawn, and 62, 992 synthesized characters which were collected from different natural images of street scenes in Bangalore, India. This dataset has 74, 000 handwritten characters with 64 classes, and they were written in Latin, Hindu, and Arabic languages. This dataset is publicly available. The GRPOLY-DB [59] is a character and word dataset collected from printed and handwritten polytonic Greek document images. The scanned printed and handwritten documents were written between 1838 and 1912. This dataset was extracted from 399 document images and consists of 102, 596 words and 171, 511 characters. This dataset is publicly available.
In addition to the datasets as mentioned above, there are other character and word datasets that are generated in other languages such as Urdu [8], Chinese [60], Persian [5] etc. A comprehensive review of the OCR systems and datasets is discussed in [61].

B. HANDWRITTEN DOCUMENT IMAGES
In recent years, various handwriting document image databases in Latin have been introduced to solve different problems in document analysis applications. For instance, George Washington [62], [63] database is one of the well-known databases which contains 20 different historical handwritten document images. The documents were written in English in the eighteenth century. The handwritten document images are labelled with 4894-word instances, 1471 different word classes, 82 letters and 656 text lines. Another database is Esposalles database [2], [64] which includes 173 Spanish handwriting document images. The Spanish documents were written between fifteen and twentieth centuries. In the handwriting document images, the text blocks and lines, as well as the transcriptions are annotated. Germana database [65] is another Spanish database, and the handwritten documents are from 1891. The Germana contains 764-page annotated Spanish document images. In addition to Esposalles and Germana, the Rodrigo database [66] was created from an older document named ""Historia de Espana~del arc¸obispo Don Rodrigo" which is from the sixteen centuries. The database has nearly 20.000 annotated text lines and 231.000 annotated words.

IV. CARDIS DATA COLLECTION
The CArDIS dataset consists of sample Swedish historical handwritten character and word images collected from 64, 084 Swedish birth record handwritten document images acquired by Arkiv Digital. In the handwritten document images, each Swedish birth record contains a newly born child's name, born date, baptized date, born place, father's name, and mother's name. Various anonymous priests recorded the handwritten documents between 1800 and 1900 in Swedish churches located in different counties such as Gotland, Gävleborg, Norrbotten, Västerbotten, Västernorrland, Västmanland, Älvsborg, and Örebro. The scanned document images are with the resolution of 6000 × 4000 in RGB color space, including various complexities such as handwriting styles, background color, and variety of degradations. The collections of CArDIS dataset (publicly available from: (https://cardisdataset.github.io/CARDIS/) are clearly explained below.
CArDIS Dataset I is generated from 64, 084 historical Swedish birth record handwritten document images and contains only isolated lowercase handwritten Latin letters from 'a' to 'z' as well as special Swedish letters (e.g., å, ä, ö). Each letter is manually segmented and cropped from handwritten document images as illustrated in Fig. 2. Note that only lowercase letters, as well as the characters which can be read and perceived, are selected (e.g. blue boxes in Fig. 2). In contrast, the uppercase letters and degraded lowercase letters are ignored as depicted in red boxes in Fig. 2. To the best of our knowledge, the CArDIS dataset is the first historical handwritten Swedish lowercase letter that provides image samples in RGB color space with original sizes as shown in Fig. 3. Moreover, in this dataset, the lowercase letter images may consist of extra parts from neighboring characters and include various artifacts such as degradation, noise, line dashes, and underlines. This dataset contains 116, 000 lowercase letter images with 29 classes, where 26 classes ('a'-'z') belong to Latin alphabets and 3 classes (å, ä, ö) belong to Swedish alphabets (see Fig. 4), with 4,000 images per class. This dataset is generated to further improve lowercase letter recognition and segmentation for OCR systems in historical document images with different degradation and complex backgrounds. This dataset will be publicly available.
CArDIS Dataset II is a Swedish word dataset that is manually obtained from 64, 084 birth record handwritten document images. To generate this dataset, the ten most popular Swedish female and male names as well as Swedish region names are collected from these birth record  documents. The female names are Anna, Brita, Maria, Johanna, Christina, whereas, the male names are Anders, Olof, Lars, Carl, and Pehr. Moreover, there are various Swedish region names in the CArDIS dataset II. This dataset includes 30, 000 Swedish female and male names' images with 10 classes, and each class contains 3, 000 images. In addition, the Swedish region names includes 1, 000 images. Fig. 5 shows several female, male, and region names' images in the CArDIS Dataset II. The Swedish male, female, and region names are manually segmented and cropped from handwritten document images and stored in original sizes as well as RGB color space. Moreover, the collected images may contain artifacts such as noise, dash lines, underline, bleed-through, faint, and many others, as shown in Fig.  5. This dataset can be used in name indexing and word recognition applications and will be publicly available.

A. SWEDISH CHARACTER AND WORD DATA CHARACTERISTICS
The CArDIS dataset is generated based on the Swedish historical document records written by different priests in the 19th century. Thus the dataset has multiple unique characteristics, as explained below.
• Degradation: The low quality of the used ink and papers in the 19 th century, age of documents, and distortions affect the characteristics of the words and letters in the CArDIS dataset. These issues result in multiple degradation and artifacts. For instance, the age of documents causes deterioration of texts and characters (i.e. faint). Moreover, other artifacts in the document images are background variation, show-through, bleed-through, and smear. Consequently, the CArDIS dataset is exhibited with many different inter-and intra-class variations. • Handwriting styles: History indicates that each person has its own unique writing style [61]. In the birth record documents, the texts were written in Gothic, cursive, and copperplate styles by various priests using different inks, nibs, and dip pens, which result in distinct appearances. For instance, applying different pressures on a nip can result in flowing different amounts of ink, generating different character appearances. Moreover, in the documents, the same word and character were written in many different sizes. Thus, the shapes can be diverse. Hence, in the CArDIS dataset, the words and characters are scripted in various writing styles, sizes, directions, widths, and arrangements. These variations in handwriting patterns due to individual writing styles and materials used to write the texts generate endless inter-class variations. • Special characters: The birth record documents were written in the Swedish language. Thus the documents do not follow the standard Latin alphabets. Although the overall writing of the documents are quite similar to the Latin, 3 extra letters such as å, ä, ö are included (see Fig.  4). The characteristics mentioned above generate many distortions in the appearance of words and characters and lead to a unique dataset where the words and characters appear with many inter-and intra-writing variations. Thus, the CArDIS dataset overcomes multiple limitations over the existing datasets. For instance, most datasets such as EMNIST are based on characters in Latin language and written in modern handwriting styles with ballpoint and rollerball pens. Besides this, they are collected from non-degraded documents, and they are size normalized. These characteristics of the existing datasets restrict the application of existing methods for handwritten historical character and word recognition where the variability and complexity become more dominant. Therefore, to support the research in the Latin and Swedish handwritten character and word recognition, a new dataset based on historical handwritten documents is generated to resolve the problem VOLUME   of the existing ones.

A. LEARNING ALGORITHMS AND HYPERPARAMETERS
For quantitative evaluations, various learning classifiers have been used. In this work, k-Nearest Neighbour (k-NN), random forest, one-versus-all SVM classifier with RBF kernel, recurrent neural network (RNN), convolutional neural networks (CNNs) and two different pre-trained deep learning methods have been selected to recognize handwritten characters and words. In the k-NN classifier, the distance is first calculated using the Euclidian distance and then handwritten characters and words are identified by the majority class of k-nearest neighbors. It is important to note that, the raw pixel values of image samples are used in the k-NN classifier, and the k value is empirically selected as 5 for classification of handwritten characters and words. Random Forest is another classifier used to evaluate the quantitative results. In this classifier, the raw pixels of image samples are first normalized between 0 and 1. After that, the random forest classifier is applied to the normalized pixel values. The classifier consists of two different parameters which are: (1) the number L of trees and, (2) the number K of random features preselected in the splitting process. In the Random Forest classifier, we set the parameters as L = 100 and k = 12. The complete assessment regarding to these parameters is analyzed and discussed in [16].
The third classifier is RBF kernel SVM. In order to get the results, two different input types are used for the SVM classifier which are the raw pixels of image samples and the features of image samples extracted by using histogram of oriented gradients (HOGs). As a result, these create two experimental structures named SVM and SVM-HOG in the rest of the paper. In the SVM classifier, we set two parameter values as γ = 0.001 and C = 1.
Another handwritten character and word classifier is developed based on Recurrent Neural Network (RNN). In RNN classifier, four-layer neural network model is designed and used to obtain the results. Firstly, the pixel values of the image sample are normalized and then normalized pixel values are used as inputs for the RNN classifier. The batch size and iteration size are selected as 64 and 10, respectively. Moreover, Rectifier Linear Unit (ReLU) is employed as an activation function in the hidden layers and in the output layer, Softmax function is used to estimate probabilities of output classes of handwritten characters and words. Artificial Neural Network (ANN) based classifier includes 5 hidden layers and output layer. This classifier is employed with the same strategy of RNN classifier.
The CNN-based handwritten character and word classifier consists of following layers; 1) Input layer, 2) three convolutional layers, 3) three fully connected layers, and 4) one output layer. The first two convolutional layer use 64 filters with the filter or kernel size of 5 × 5, and the last convolutional layer uses 128 filters or kernels with the same kernel size. The convolutional layers are followed by fully connected layers which each one contains 128 nodes. In addition, the ReLU is employed as an activation function in all connected layers except final layer. Softmax is used in the final layer to estimate the probabilities of output classes of handwritten characters and words. In the CNN model, the batch size and iteration size is set to 200 and 10, respectively. In VGG-16 and VGG-19, the number of neurons in the last fully connected layer has been changed to the number of classes. In both methods, the learning rate, epoch and batch size are set to 0.001, 200, and 32, respectively.

B. EXPERIMENTAL SETUP
Since the EMNIST dataset is the only available Latin character dataset, we manually collected character samples from IAM and CVL datasets to demonstrate characters in the CArDIS dataset effectively. Each of these collected datasets consists of 104,000 character images with 26 classes. Moreover, 80% of the character datasets are randomly selected and used for training, and the rest is used for testing. To obtain the results, six different classifiers, which are RNN, k-NN (k = 5), RF (L = 100 and k = 12), SVM with RBF kernel function, ANN with 5 hidden layers, and CNN with 3 convolutional layers and 3 fully connected layers are used. In addition to these algorithms, Histogram Oriented Gradient (HOG) feature extraction technique is used with two conventional classifiers which are SVM and

D. COMPARING CLASSIFIERS ON CARDIS CHARACTER DATASET
In contrast to the experiment one, experiment two contains Swedish characters which includes 29 classes. To conduct this experiment, 87, 000 handwritten samples in the CArDIS are used to train the classifiers and 29, 000 samples are used to evaluate the performance of them. Fig. 6

E. PERFORMANCE OF CLASSIFIERS ON CARDIS WORD DATASET
The third experiment aims at understanding and evaluating the performance of the machine learning classifiers using CArDIS word dataset which contains 30, 000 Swedish names with 10 classes. To achieve the results, the word dataset is first divided into 80% training and 20% testing, thus 24, 000 image samples are used to train the classifiers and 6, 000 image samples are used to test the performance of the classifiers. Table 3 tabulates the recognition accuracy rates  Table 3 indicate that the Swedish names in the CArDIS dataset are complex and difficult to recognize since the classifiers provide not very high recognition performance.

VI. CONCLUSION
In this paper, a new historical handwritten character and word dataset, named CArDIS, is introduced and publicly available for the research community (https://cardisdataset.github.io/ CARDIS/). The CArDIS is manually collected from Swedish birth record handwritten document images written in the 19 th century. This handwritten dataset consists of (1) alphabetic character images in Latin and Swedish languages with original appearances in RGB; and (2) 10 popular female and male Swedish names' image samples and Swedish region names' image samples with original appearances. In this paper, various classifiers have been trained on three different handwritten Latin character datasets and tested on the CArDIS character dataset. The results verify that classifiers perform poorly and give less recognition accuracy, indicating that the characters in the CArDIS have different features and characteristics than the existing Latin character datasets. Moreover, deep learning-based methods provide the best recognition performance to recognize the Swedish characters and names comparing to the other comparing methods. The CArDIS will be publicly available to improve the performance of OCR systems further.

ACKNOWLEDGMENT
This work is supported by the Research Project "Scalable resource efficient systems for big data analytics" funded by the Knowledge Foundation, Sweden, under Grant 20140032.