Image Classification in Arabic: Exploring Direct English to Arabic Translations

Image classification is an ongoing research challenge. Most of the current research focuses on image classification in English with very little research in Arabic. Expanding image classification to Arabic has several applications and benefits. This paper investigates the accuracy of direct translations of English labels that are available in ImageNet, a database of images labeled in English that is commonly used in computer vision research, to Arabic. A dataset comprised of 2,887 labeled images was constructed by randomly selecting images from ImageNet. All of the labels were translated to Arabic using an online translation service. The accuracy of each translation was evaluated by a human judge. Results indicated that 65.6% of the generated Arabic labels were accurate with the highest results achieved when the labels consisted of only one word. This study makes three important contributions to the image classification literature: (1) it determines a baseline level of accuracy for image classification in Arabic algorithms; (2) it provides 1,910,935 images classified with accurate Arabic labels (based on accurately labeling 1,895 images that consist of 1,643 unique synsets); and (3) it measures the accuracy of translations of image labels in ImageNet to Arabic.


Introduction
In recent years, advances in artificial intelligence research have been significant.One area of artificial intelligence that has seen significant advances is the study of computer vision and image classification.The goal of image classification is to generate an accurate label or a group of labels for an image that captures the content(s) of the image (Akata, Perronnin, Harchaoui, & Schmid, 2014).Several highly accurate image classification algorithms currently exist, and this can be attributed to the recent availability of large databases of labeled images that can be used to train and evaluate image classification algorithms.The most well-known database is ImageNet (Deng et al., 2009), which includes over 14M images tagged with English labels.Each image in the dataset is labeled with a wordnet synset for each object present in an image.An annual competition, the Large Scale Visual Recognition Challenge, is held for researchers and scientists to complete different tasks using this database (Russakovsky et al., 2015).
Most of the current research on image classification focuses on developing algorithms for labelling objects present in an image in English.The focus on the English language could be due to the lack of databases, like ImageNet, that contain images labelled in other languages.While there has been some work on developing algorithms and methods to classify images in languages other than English, image classification for the Arabic language remains an unexplored problem.Because Arabic is one of the most spoken languages in the world, and due to how image classification can be used in several applications that users interact with, exploring image classification for Arabic is relevant and requires investigation.
It is hypothesized that state-of-the-art image classification algorithms that rely on a training dataset such as ImageNet (with English labels) should also perform equally well when the underlying training dataset has images with labels written in other languages such as Arabic or Chinese.In other words, the original language of the labels in the database should not affect the accuracy of the classification algorithm.The question of "would the most accurate image classification algorithms be able to produce highly accurate results if the dataset used for training includes labels in Arabic?" is an important question that should be investigated.However, it is outside of the scope of this study which focuses on measuring the accuracy of labels generated when English labels or categories in ImageNet are directly translated to Arabic.A high accuracy would suggest that using highly accurate image classification algorithms to classify images in English and then translating the labels to Arabic or other languages could produce highly accurate results.In contrast, low accuracy results would suggest that other alternatives should be considered.
It is relevant to note that in the domain of natural language processing, researchers often focus on developing solutions specific for a target language such as Arabic or Chinese for common tasks.These tasks include for example document classification and document summarization (Alanzi & Abuzeina, 2017;Al-Thubaity, Alhoshan, & Hazzaa, 2015;Kanan & Fox, 2016;Zhang, Xu, Su, & Xu, 2015).Therefore, focusing on novel methods to generate labels for images in Arabic is similarly important.The primary objective of this study is to investigate the accuracy of Arabic labels that are directly translated from English labels available in ImageNet.To explore this problem, a sample of 2,887 images was randomly selected from ImageNet.An English-to-Arabic online translator was used to translate the ImageNet labels to Arabic and evaluation of the translation was conducted to measure the accuracy of the translation.
The present work makes several contributions to the literature.First, this is one of the first studies to focus on generating Arabic labels for images.The results from this study can then be used as a baseline for future image classification for Arabic methods.Second, because English is the primary focus of image classification research, this study is one of the first to examine the accuracy of direct translations of ImageNet's labels to other languages.If the accuracy of such direct translations is high, similar techniques can be applied to translated labels to other languages.Finally, this study provides a database of 1,895 images with accurate Arabic labels that can be used for other Arabic image classification methods.

Image Classification
Image classification is the task of identifying and labelling an object or list of objects present in an image.Recent advances in the field are partly due to the availability of new large-scale image datasets such as ImageNet (Deng et al., 2009).ImageNet has helped accelerate the progress of artificial intelligence research on a broader scale and image classification research in particular (He, Zhang, Ren, & Sun, 2016;Krizhevsky, Sutskever, & Hinton, 2012;Ren, He, Girshick, & Sun, 2015;Simonyan & Zisserman, 2015).Improvements in the performance of recent image classification methods is also due to the use of convolutional neural networks (Huang et al., 2017).Many of these recent methods focus on "zero-shot" learning where objects are recognized even if they were not present as labeled data in the training dataset (Xian, Schiele, & Akata, 2016).Other tasks related to image classification have also seen major advances.These tasks include object detection and object tracking.The goal of object detection is to find the boundaries of multiple objects in images.Highly accurate object detection algorithms include YOLO9000 (Redmon & Farhadi, 2016) and R-FCN (Dai, Li, He, & Sun, 2016).As for image tracking, the goal is to track the movement of an object in a scene and recent work have been promising (Gaidon, Wang, Cabon, & Vig, 2016;Held, Thrun, & Savarese, 2016).
Identifying the objects present in an image could be beneficial in the development of several text-based applications.For example, image classification can be used to generate image captions that consist of full sentences that describe the contents of an image rather than captions that only list the objects in the image (Karpathy & Fei-fei, 2017;Xu et al., 2015).Another application of image classification is visual question answering (Antol et al., 2015;Yang, He, Gao, Deng, & Smola, 2016).In this task, the objective is to answer a question about an image in a natural language.For example, when viewing an image of two teams playing soccer, a question could be "what are the colors of the soccer teams' shirts?"A successful answer would be one that contains the correct colors.There are also several applications in specific industries.In the healthcare industry, image classification methods can be used to generate medical text reports and relevant keywords that are based on images (Litjens et al., 2017), tasks that are undoubtedly important.Although image classification research has important applications, the focus has been on the English language.This is one limitation that needs to be addressed because image classification applications have direct interactions with users who do not speak English.

Arabic Natural Language Processing
Arabic is one of the most common languages spoken around the world, thus, it is important to study computational solutions applied to the Arabic language.Researchers have studied various problems related to processing and analyzing texts in Arabic.Although several of the problems overlap with common NLP tasks, there are some Arabic-specific issues that are being investigated.Examples of such problems include developing methods for named entity recognition in Arabic (Oudah & Shaalan, 2016;Shaalan & Raza, 2009), sentiment analysis (Al-smadi, Talafha, Al-Ayyoub, & Jararweh, 2018;Rushdi-saleh, Martín-valdivia, Ureña-lópez, & Perea-ortega, 2011), and question answering systems (Azmi & Alshenaifi, 2017;Nicosia et al., 2015).
Several scholars have discussed the difficulties associated with developing natural language processing methods and algorithms for Arabic.These challenges include the ambiguity and complexity of Arabic (Kanan & Fox, 2016;Salloum, Al-emran, & Shaalan, 2016), the prevalence of several commonly used dialects in Arabic (Samih et al., 2017;Zalmout, Erdmann, & Habash, 2018), and the limited number of freely available datasets that can be used in the research and development for Arabic computational solutions (Zeroual & Lakhouaja, 2018).This study further investigates the complexity of Arabic and the problems associated with computational solutions that do not incorporate Arabic dialects.Additionally, to address the problem with limited Arabic databases, this study introduced a new dataset that provides images labeled with Arabic labels.This study was conducted to address the need to build solutions specific for Arabic and the challenges faced when common algorithms are run on Arabic datasets.

Image Classification for Arabic
As stated above, image classification research has focused primarily on English and other Latin languages.However, several remotely related works exist; for example, in one paper, the authors attempted to create a method that recognizes Arabic text in images (Slimane, Kanoun, Hennebert, Alimi, & Ingold, 2013).In another paper, the authors created a new dataset that consists of images extracted from Arabic books and newspapers that have Arabic writings with the objective of aiding research that utilizes such images to transcript the Arabic text in the Arabic documents (Saad, Elanwar, Kader, Mashali, & Betke, 2016).
There is only paper directly related to the current study focused on generating fully Arabic captions for images (Jindal, 2018).In Jindal (2018), the author used a convolutional neural network to generate full sentences in Arabic that describe the contents of images.The author incorporated Arabic root words in the training set.His method achieved a BLEU-1 score of 65.8 when it was tested on the Flicker8k dataset (with Arabic labels that were written by Arabic translators) and a BLEU-1 score of 55.6 when it was tested on 405,000 captioned images scraped from Arabic websites.BLEU is an evaluation metrics that is often utilized in machine translation and similar tasks (Papineni, Roukos, Ward, & Zhu, 2002).The author indicated that his method performed better than when common image classification methods were used to generate full captions in English, and then a translation service was used to translate the captions to Arabic.While promising, no details on the translation or evaluation of the translation were given.

Methodology
This study's procedure is in Figure 2. The first step was to randomly select 10,000 images from ImageNet.Following that, based on inclusion criteria determined before the study, the sample was reduced to 2,887 images.Then, the English labels for the images from ImageNet were translated to Arabic using Google Translate.Subsequently, the translated Arabic labels were evaluated to determine if they accurately described the objects in the images.The details of how ImageNet was used and the translation process employed are in the following subsections.

ImageNet and Dataset
The dataset used in this study was constructed from the Fall 2011 release of the ImageNet database (multiple releases or versions of the datasets currently exist) (Deng et al., 2009).The dataset consists of 14,197,122 annotated images.The dataset is used in a well-known competition called the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al., 2015).The competition has multiple categories of challenges and teams participate by writing solutions to target the specific challenges of each task.The challenge (and therefore database) acknowledges that an image could have multiple objects present.For example, a picture with a cat sitting on a table has at least two objects: A cat and a table.Furthermore, the database includes data on bounding boxes that define the location of an object in an image.In this study, the bounding boxes and the existence of multiple of objects in an image are not considered as they are outside of the scope of the study.Instead, only one of the categories for the image as identified in the ImageNet dataset was used in this study.
In ImageNet, several attributes are available for each image.These attributes include the synset of the object in WordNet as well as the URL of the webpage where the image was initially downloaded.In this study, a randomly selected sample of 2,887 images from ImageNet was used.To construct the sample of 2,887 images, a larger sample of 10,000 images was first randomly selected from ImageNet using the "random" library in python 1 .Because the URL of the images were used in this study to view the images (instead of downloading the entire dataset which is well over 1TB), a python script was used to determine if each image was accessible online and useable for this study.After removing images that were no longer accessible online, the sample of images was reduced to 2,887 images.The images in ImagesNet are labeled with a WordNet sysnet.In WordNet, words may have multiple sysnets where each is a unique definition of the word.Furthermore, each sysnet included the synset's part of speech such as a noun or a verb.For example, for the word "chair", the first definition (n_01) is about the piece of furniture while the second (n_02) refers to a job that a professor may have.In ImageNet, the synset number is included with the identified object.For example, for an image of a person sitting in a chair, the sysnet number n_01 is included.In this study, the synset number was not incorporated in the translation process.Based on the available literature, there are no methods that translate a synset in the English version of WordNet from English to Arabic.

Translation Process
For the translation process, the API of Google Translate was used.Google Translate is an online service where an input text that is written in a particular language can be translated to another selected language.While no reliable information is available on the accuracy of the API when used to translate text from English to Arabic, according to one study, compared to three other online translation services that include Microsoft Bing, Google Translate was more accurate at translating sentences from English to Arabic (Al-shalabi, Kanaan, Al-Sarhan, Drabsh, & Al-Husban, 2017).For this reason and because of its reliability and (perceived) popularity, Google Translate was used in this study.Using other online translation services or human translators to translate the labels of images in ImageNet from English to Arabic may produce results that are different than the ones found in this study.
Google Translate was used to translate all the labels for the sample of 2,887 images from English to Arabic.The results of the translation were added to an online datasheet that contains the following: the unique identifier for each image, the label for each image in English as appeared in ImageNet, the label for each image in Arabic as translated by Google translate, and embedded images that can be viewed in the sheet.This datasheet will be used to evaluate the results.

New Dataset of Images with Arabic Labels
Following the evaluation process (which is explained in the experiment section of this paper), all the images with a correct Arabic translation were added to a separate dataset that consists of images with correct Arabic labels.This dataset could be used to test other similar methods for generating Arabic labels for images.However, one potential limitation is that the dataset contains only images that are perhaps easier to translate from English to Arabic.Furthermore, the dataset compiled as a result of this study may include a percentage of fine-grained categories (detailed categories such as specific types of trees or breeds of dogs) that is less than the ones included in the ImageNet database.

Experiment
To investigate if the textual structure of the labels has various effects on the performance of using a translation service to translate labels of images form English to Arabic, the dataset was divided into smaller sections based on the number of words in the object's name, and three classes were created.The first class of "unigrams" included labels that consist of only one word; the second class of "bigrams" included objects that consist of two words; the third class of "ngrams" included objects that consist of three or more words.Table 1 includes the number of images for each of the three classes of words.The purpose of dividing the dataset into these three categories is to investigate if there are differences in performance for each class.It is possible that the accuracy of the translation will be higher when labels consist of one word.

Evaluation Process
Because the objective of this study is to investigate if direct translation of English labels to Arabic will generate meaningful results, it is important to judge and determine if a translated label is correct.
One of the characteristics of ImageNet is the finegrained aspect or specificity of the categories generated for objects.For example, instead of labeling an image of a bird with the caption "bird", the specific name of the bird such as "great blue heron" or "trogon" is used as a label.In this study, several Arabic dictionaries were used in the evaluation process because several of the images in the dataset have fine-grained categories.Furthermore, Google Translate produces Arabic results in Modern Standard Arabic (MSA), and the definition of these results may not be widely known.More specifically, when the translation was not clear or if it needed further clarification, a search was conducted for the definition of the text that was translated to Arabic.
The online version of WordNet (Princeton University, 2010) and other online sources were also used to aid in the evaluation process.For example, when the usage of the label was not easily recognized, WordNet was used to find and identify the full definition of the label.
To evaluate the translation of each image, the image, its English label, and its translated Arabic label were assessed for accuracy.The evaluation was conducted by one person, and for each image, accuracy was categorized based on four options: correct, incorrect, neutral, and English.
Labels categorized as "correct" indicated that the Arabic definitions were identified as accurately describing the object in the image.Labels categorized as "inaccurate" indicated that the translated label did not accurately describe the object in the image.Labels categorized as "neutral" indicated that there was not enough information to confidently indicate whether the translation is accurate or not.This could happen for fine-grained categories such as the names of birds or trees.In these instances, it was a failure to determine if the names of these birds and trees in Arabic as identified by Google Translate were correct.Finally, the label of "English" was used because the output from Google Translate's API was an English word rather than an Arabic word.For example, the API translated the word "barouche" to "barouche".It is unclear why this occurs.However, a manual inspection of these instances suggests that there are not equivalent Arabic words for these English words.For some translations, the text included both an Arabic word and an English word.These instances were also labeled with the category "English".
The accuracy of the method was defined as the number of images that were labeled as "correct."Images that are labeled with the "inaccurate", "neutral", or "English" labels were all considered as images that had incorrect translations.The accuracy of the method was calculated by dividing the number of images with correct translations by the total number of images in the dataset.

Results and Discussion
Results from the evaluation process showed that translated Arabic labels for 1,895 out of the 2,887 images were accurate.In other words, 65.6% of the image labels that were directly translated from English to Arabic accurately described the objects in the images.Alternatively, 35% of the translated labels did not accurately describe the object in the image.Table 2 includes a summary of the results and performance of the translations on unigrams, bigrams, and ngrams.
While relatively small, the 1,895 images with the accurate Arabic labels represent a new dataset that can be used in future research involving tasks related to image classifications for Arabic.The dataset consists of a total of 1,895 (1,288 unigrams, 576 bigrams, and 32 ngrams).
As predicted, results varied based on the three types of textual structures of labels (unigrams, bigrams, and ngrams).Accuracy was highest when the labels were unigrams.In these instances, 71% of the translated labels were classified as correct.Accuracy for bigrams and ngrams was lower, 58% and 45% respectively.One factor that could have attributed to this decrease in accuracy is that several of the English bigrams and ngrams labels were translated to Arabic as a set of unrelated words instead of a single unit or noun phrase.Future studies should be conducted to determine if other factors also contribute to the lower percentage of accurate translations for bigrams and ngrams.Four interesting categories of results were noticed during the evaluation process.These categories reveal common mistakes in the translation process and areas where novel solutions can be explored.Several examples of these instances are displayed in Figure 4.The importance of these categories is that minor modifications and preprocessing of the English labels conducted prior to the use of a translation service may help increase the overall accuracy of using a translation service to generate Arabic labels for images.Although the four categories represent a selected number of interesting types of results, there may be other types that were not included in this list of categories.The other types may include a set of results where the labels were translated to Arabic but the translation included one or more English words.Another interesting category is of noun phrases in English that were translated as a group of unrelated words in Arabic.

Category one: Incorrect synset
It is common for words in English to have several synsets or definitions.The first category of interesting results is of objects that contained a specific synset as a label in ImageNet and the result of the translation was inaccurate because Google Translate translated a different synset of the label or word.For example, for an image of a "skunk," the animal is present in the image.However, Google Translate assumed that the word "skunk" was used to refer to obnoxious or unfriendly individuals who are described as "skunks."Thus, the translation was labeled as inaccurate.All the images in this category are classified as ones with inaccurate translation.If a word has multiple synsets, it is unclear how Google Translate determines which synset to use.Providing certain information that can be used to ensure that the correct synset is used by the translation service may produce results that are more accurate.It is important to state that ImageNet includes the synset number for the categories used in ImageNet.For example, for the image of a skunk, ImageNet specifies that the fourth noun synset of the word is used in WordNet.The fourth synset's definition is "American musteline mammal typically ejecting an intensely malodorous fluid when startled; in some classifications put in a separate subfamily Mephitinae".Therefore, this synset number can be used to identify the context and usage of the word that is being used by ImageNet.
In the present study, the synsets' numbers were not incorporated in the translation process as it is outside of the scope of the study.The implementation of preprocessing steps that help clarify the synset used by ImageNet prior to using a translation service may reduce the inaccuracy of the translated labels.However, it is important to note that several image classification algorithms generate labels without providing the synset number.Therefore, building an Arabic image classification method that depends on using synsets' number will fail if the underlying image classification algorithms in English do not specify the synsets' numbers of the images in the results.

Category two: Full definitions
The images in this category of interesting results are of images where the translation provided by Google included a full definition in Arabic.In other words, Google Translate translated a single word in English to a full sentence in Arabic that was a full Arabic definition of the English word.For example, for the word "pretzel", the translation in Arabic was ‫جاف‬ ‫و‬ ‫مملح‬ ‫كعك‬ which can be translated back to English as "dry and salty cake" rather than the Arabic word for "pretzel".Figure 4 includes additional examples of images in this category.It is unclear why the full definitions were given for several of the images.However, it seems that such instances happen for some English words that do not have direct Arabic words.In the evaluation process, all the images in this category were classified as accurate only if the Arabic definitions were classified as accurate.

Category three: Correct but uncommon
The third category included Arabic labels that accurately described the images, but the labels or Arabic words are uncommon and rarely used in Arabic.The translation was deemed accurate only after an Arabic dictionary was used to identify the meaning of the words provided by Google Translate.While the translations were accurate, these words may not be recognized by native Arabic speakers.For many of the images in this category, it can be argued that the English words as used by ImageNet are similarly uncommon and rare.Words such "earthwork" and "teasel", which are displayed in figure 4, may not be known to native English speakers.Several of the images in this category are part of the class of "fine-grained" categories.All the images in this category were classified as correct.It is possible that several of the images that were classified as "neutral" in the evaluation process contain correct but uncommon Arabic words that were not known to the judge and could not be identified using an Arabic-to-Arabic translator.

Category four: Same word, different alphabet
The fourth category included images where the English labels were translated into the same word spelled with letters in the Arabic alphabet.For example, for an image of a "hamburger", the translation was the word "hamburger" but in Arabic letters rather than English letters.This presumably happens because the English word is entered into the Arabic dictionary through a change in the individual letters.This is similar to how several Arabic words such as Hummus and Falafel are entered into the English dictionary.Most of the images in this category are names of cuisines.However, it cannot be determined whether all of the images in this category were translated in this way for that same reason.All the images in this category were tagged as accurate in the evaluation process.

Conclusions
An experiment was conducted to examine the accuracy of Google Translate's translations of labels in ImageNet from English to Arabic.A sample of 2,887 images were randomly selected from ImageNet to test the accuracy of the method described above.The major finding of this study was the discovery that 65.6% of the images resulted in accurate translations and had objects that were correctly identified.This finding can be used as a baseline accuracy level for other image classification methods for the Arabic language.Additionally, because recent image classification methods for English have low error rate, the results suggest that advanced methods that rely on underlying datasets that consist of Arabic labels should be considered.Furthermore, this study provides a dataset of 1,895 images that are labeled with correct Arabic labels.This is an important contribution as this dataset can be used in subsequent studies that target image classification for Arabic.
With additional modifications and the inclusion of preprocessing steps, using online translation services to translate labels of images to Arabic could produce better results.One common issue that occurred during the translation was when the incorrect usage of a word was used in translations.By providing contextual information about the image prior to the translation, the accuracy of the translated labels could be higher.
One noticeable cause of incorrect translations was the fine-grained nature of categories in ImageNet.These categories include specific types of birds or breeds of dogs.Several of these birds and dogs may not exist in Arabic speaking countries, and it is uncertain that Arabic names for these categories exist.Therefore, simply translating the names of such birds to the Arabic word for "bird" will undoubtedly increase the accuracy of the image classification method used in this study.However, one of the primary features of ImageNet is its fine-grained image categories.Therefore, it is perhaps unwise to create methods that overlook this feature.
Compared to the performance of the latest image classification algorithms, the method demonstrated in this study incorrectly labeled a high percentage of images.To reduce the number of mistakes that occur during translation, there are two main directions for future research.The first is to introduce modifications to the current method either prior or after the translation process such as the inclusion of a preprocessing step that can be used to specify synset of the word is used or the addition of additional translation service so that more than one translation service is used.The second is to train the latest image classification methods on a dataset that consists of images with Arabic labels.Additional studies should investigate these directions as well as explore the advantageous and disadvantageous of these two options.

FIGURE 1 .
FIGURE 1. Sample of images from ImageNet and their labels

FIGURE 2 .
FIGURE 2. Overview of the methodological procedures for this study

FIGURE 3 .Figure 3
FIGURE 3. Sample of images with correct translations

TABLE 1 .
The dataset and the number of images for each class

TABLE 2 .
Summary of results