Automatically Discovering Relevant Images From Web Pages

Web pages contain irrelevant images along with relevant images. The classification of these images is an error-prone process due to the number of design variations of web pages. Using multiple web pages provides additional features that improve the performance of relevant image extraction. Traditional studies use the features extracted from a single web page. However, in this study, we enhance the performance of relevant image extraction by employing the features extracted from different web pages consisting of standard news, galleries, video pages, and link pages. The dataset obtained from these web pages contains 100 different web pages for each 200 online news websites from 58 different countries. For discovering relevant images, the most straightforward approach extracts the largest image on the web page. This approach achieves a 0.451 F-Measure score as a baseline. Then, we apply several machine learning methods using features in this dataset to find the most suitable machine learning method. The best f-Measure score is 0.822 using the AdaBoost classifier. Some of these features have been utilized in previous web data extraction studies. To the best of our knowledge, 15 new features are proposed for the first time in this study for discovering the relevant images. We compare the performance of the AdaBoost classifier on different feature sets. The proposed features improve the f-Measure by 35 percent. Besides, using only the cache feature, which is the most prominent feature, corresponds to 7 percent of this improvement.


I. INTRODUCTION
The internet provides us with a lot of valuable content among HTML tags. Nowadays, extracting this content automatically is one of the challenging tasks in information retrieval. In this task, the researchers mostly concentrate on obtaining textual contents including title [1], [2], main content [3], [5], summary [5], and reviews [6], [7]. Also, there are few studies [8], [9] that focus on automatically extracting relevant images. Moreover, the performance results of these studies are low. This study aims to improve these results through the proposed novel features.
Previous studies [8], [9] consist of around 1100 web pages which are standard news. However, a website consists of many different pages including standard news, gallery pages, link pages. A web page of standard news may have one or more relevant images but no images at all. A gallery web page The associate editor coordinating the review of this manuscript and approving it for publication was Maurizio Tucci. contains too many related images. A link web page has no relevant images. In this study, all different pages are included in the data set. Moreover, while previous studies focus on one relevant image, namely a representative image, in the content, this study concentrates on all relevant images in the content. For our purposes, we developed a web crawler 1 to download web pages and their images for creating a new dataset. 2 With this crawler, 20,000 web pages and 635,015 images were downloaded from 58 different countries and 200 different websites. Unlike other studies, a very large dataset is created that generalizes the solution.
Features are needed for an automatic prediction process. For this task, the first features that come to mind are height, width, file extension, and file size of an image. Unfortunately, these features are not enough for accurate prediction. In order to resolve this situation, we identified features that can be obtained from the web page by considering other features [5], [10] suggested in the literature. Furthermore, new features are suggested thanks to the additional modules added to our crawler in this study. As a result, 30 features were constructed for an image. To the best of our knowledge, 15 of these features have been mentioned for the first time in the literature. The contribution of these features to the prediction process is examined in the experimental section of this study.
Some of the approaches manually assign weights to simple features [8], [11]. Heuristics based on these weights are applied on datasets with limited samples. Increasing the number of samples is beneficial for the prediction accuracy of machine learning models. For determining the most representative image on a web page, Vyas and Frasincar [9] construct their prediction model on the machine learning method, Support Vector Machines (SVM). Nonetheless, these studies are not contributive in terms of the comparison of various classification methods. In this study, we report the best evaluation score along with the comparison of several different machine learning methods in discovering the relevant images.
Nowadays, Web pages have very complex and visually rich layouts. These layouts contain a lot of images and contents. However, the number of relevant images is very few. For this reason, the image datasets constructed for relevant image extraction is imbalanced. This is known as an imbalanced dataset problem that negatively affects the prediction model of machine learning methods. The problems caused by this imbalanced dataset to the prediction model can be overcome with ensemble methods, which are a special form of a machine learning method. Ensemble methods combine prediction models obtained from various machine learning algorithms in order to produce an optimal prediction model. The experiment section examines the contribution of these methods in our imbalanced dataset. In a prediction model derived from a machine learning method, not all features have the same effect for the model. Some features may not contribute to the model at all. That is, having irrelevant features in your dataset can decrease the performance of the models constructed from machine learning methods. Therefore, finding the required features, namely feature selection, from a dataset has become a critical issue for enhancing this task. Feature selection is a practical solution to find out a minimal feature subset [12], [13]. You can apply this solution to select those features which contribute most to your performance of the model. In this study, subsets of features are constructed by taking information gain values into account, and then models obtained from these subsets are evaluated.
The rest of the study is organized as follows. The second section is about related studies. The third section gives information about the layouts and design on a web page. The fourth section introduces our web crawler and features extracted from this crawler. The fifth section covers the dataset constructed with our crawler, machine learning methods, performance metrics, and the experimental evaluations of these methods. The last section reports our conclusion and future studies.

II. RELATED STUDIES
Image retrieval addresses the issue of browsing, searching, and retrieving images from a large collection. Most of the studies are the issue of searching on web data which is the world's largest collection. Kherfi et al. [14] compare some existing image retrieval algorithms, which are based on textual user queries. They have emphasized that collecting relevant images in web content to obtain information from websites with a few texts has become an important issue. Nie et al. [15] propose an approach to automatically estimate the image search performance. Kennedy and Naaman [16] introduce an approach to extract representative images for landmark each from the Web. Wang et al. [17] use images of a web page in order to classify it as objectionable or benign. Hor and Fekri-Ershad [18] present an approach to retrieve images from a large database. In this approach, they use local texture information. Bani and Fekri-Ershad [19] utilize texture and colour information obtained from spatial and frequency domains. In our study, we focus on the retrieval of images from the web pages by using the information of the website. Besides, the entire website is utilized while discovering the relevant images on the web page.
Some studies focus on a single image, namely a representative image, that represents the content on a web page to users. This image is chosen among multiple images. Helfman and Hollan [20] rate images based on features including large, square, and colorful. Gali et al. [8] give scores for images using image size, aspect ratio, alt tag, title tag, image path, image format. They use a heuristic score for every feature. Hu and Bagga [21] introduce an algorithm for automatically classifying images into image categories including story and preview images. Bhardwaj and Mangat [22] propose a technique to extract image tags which size greater than 120,000 pixels (for example, the size of tags 300 × 400 or 400 × 300). Sabri and Man [23] introduce an algorithm that uses the Document Object Model (DOM) and JavaScript Object Notation (JSON) for extracting images from a web page. These studies use a single web page. Our study utilizes websites that contain a lot of web pages thanks to our Web crawler.
Web crawlers download a lot of data that can be transformed into information for different purposes. Noh et al. [24] utilize term frequency/document frequency, entropy, and compile rules for determining the relevance of a webpage for a topic. Pant and Srinivasan [25] uses texts around a hyperlink within a Web page for predicting corresponding hyperlinks. Batsakis et al. [26] use a web page content and link information for determining download priorities in order to improve the performance of their crawler. Uzun et al. [5] develop an intelligent crawler, namely iCrawler, automatically pulls content out of various layouts for improving the crawling process. Uzun [10] proposes a novel approach that extracts data quickly using the string functions and additional information including the starting position, the number of the inner tag, and tag repetition obtained from web pages. The data obtained through web crawlers can be used for many different purposes. There is a considerable amount of literature [27] on this issue. In this study, we utilize information derived from the crawling process.
Some studies employ machine learning methods in image problems. Hu and Bagga [21], Tong and Chang [28], Zang et al. [29] use SVM in order to find informative images based on textual queries from users. Vyas and Frasincar [9] focus on determining the most representative image on a web page. They also use SVM and new features to improve their f-Measure score. In this study, many machine learning methods are tested rather than focusing on just one machine learning method. Besides, the number of features is reduced with feature selection methods that are not recommended in other studies.
Companies such as Google, Bing, Yandex, and Yahoo which want to easily obtain the data on web pages, have a joint effort to suggest attributes for HTML tags. The best-known organization on this issue is schema.org. Schema.org creates a structured data markup schema supported by major search engines. Such organizations are important for constructing the Semantic Web. On the other hand, when the website concept changes, the proposed structure alters. For example, Facebook recommends the ''Facebook Open Graph'' structure. However, this structure is not enough for search engines. The standards on this issue are defined by W3C with the RDF (Resource Description Framework 3 ) that is a framework for representing information on the Web. Although this issue has been studied for many years, Web designers must follow the rules. However, these designers and developers interpret the suggestions on their own or do not use them at all. In this study, we aim to present a general solution without using such suggestions. Our study can be used in a system that proposes automatic tag attributes for web designers.

III. MOTIVATION
There are a lot of different website layouts 4 that define a website's structure. Relevant and irrelevant images are included in these layouts. Relevant images are images related to the content of the page, while irrelevant images are images including advertisements, other links, headers, logos, etc. Images can be of many different sizes and in different parts of a web page. Fig. 1. contains some examples of these designs and images in these layouts. 3 RDF is a W3C standard for data interchange on the Web. For more information, see https://www.w3.org/RDF/ 4 Web layouts consist of patterns that rule the structure of the document. For more examples about layouts, see https://www.w3schools. com/css/css_website_layout.asp and https://www.w3schools.com/css/css_ templates.asp.   for these two images. However, all img elements should be considered when creating this rule. Parent elements can be considered as a solution if no distinguishing feature can be found. In layout design, several HTML tags such as div, span, figure, li, article, aside, etc. can be used to design the arrangement of visual elements on a page. These tags are used as the parent of an img element. For example, the class = ''easy_img_caption'' attribute might be a distinguisher for the relevant image in the example. When these possibilities are tested and the appropriate value is selected, an extraction rule is obtained for this web page. However, whether this extraction rule obeys all web pages on the website is a problem that needs to be investigated separately. As you can see, determining the relevant image only through HTML elements is an error-prone process.
A website consists of many different layouts such as homepage, link page, content page, gallery page, video page, etc. A web page, such as the homepage and link web page, does not contain any relevant images. A content web page may contain one or more images. A gallery web page usually contains more than one relevant image. For these reasons, it is another challenge to automatically identify relevant images on a website.

IV. CRAWLER AND FEATURES
Determining the features is one of the crucial issues for the prediction model produced through machine learning methods. Another crucial issue is to collect data for these features. In this study, the web crawler whose flow chart is given in Fig. 3 is developed for obtaining data and extracting features from this data.
Our crawler begins with one or more URLs given by a developer. Then, it fetches the web page at that URL. This page is parsed to extract both images (Image Extractor) and the links from the page (URL Extractor). Image Extractor returns all img elements in the web page. URL Extractor finds all links on the web page and stores these links to the storage of URLs frontier. In this storing process, URLs that are outside of the website are filtered. Besides, the crawler checks for Visited URLs / URLs frontier and does not append the existing URL to URLs frontier. The scheduler gets a URL VOLUME 8, 2020 from storage and calls the downloader to start the process again.
A classic crawler does not download images. The crawler we have developed downloads the images one by one and collects data for the features used for machine learning methods. These data are kept in a tab-delimited file that contains features separated by tab characters. In this study, these features are divided into three categories.
• Textual features • Image features • Last features Textual features (1 in Fig. 3) are obtained directly from the HTML elements on the web page. Here, it is determined whether the attribute keys in the element exist or not. The attribute values are not used as they differ between websites. Firstly, the data of these features are added to the tab-delimited file.
Image features (2 in Fig. 3) are the data obtained from the image. If the image is not in the image storage, it will be downloaded. Otherwise, the image information is taken from this storage. This information keeps values such as height, width, file size, the file extension of the image. However, features for an img element are not kept here. Because features extracted from an img element may differ between web pages. For this reason, the src value is used as the key in the image storage process. All search operations are carried out on this key. That is, key/value (src/image information) pairs are stored in image storage. In storage, we use a hash map data structure so the time complexity of the searching process is O(1). If the scr value is in the image store, the cached feature is set to 0, otherwise 1. The rows according to the data of these features in the tab-delimited file are updated.
Last features (3 and 4 in Fig. 3) are generated after all images are downloaded. Except for one of these features, the numerical data obtained from all images are used. The cluster feature utilizes the textual data of Visited URLs. The data of last features are calculated after obtaining the data of textual and image features. Finally, the tab-delimited file is updated.

A. TEXTUAL FEATURES
An image element contains the img tag and attributes that provide additional information about this image. The most important feature is src that specifies the path to the image. Other attributes such as alt, class, id, height, width, style, sizes, style are optional. These attributes can have different values even on a website. Table 1 indicates image elements, two-parent elements, and their relevant status.
Some of the attributes are given in Table 1. The web designer is free to set the feature names for himself. On the other hand, the value of the attribute can be different for each element on a web page, as well as from website to website. For this reason, when specifying features, if the attribute value is a string, it is selected whether the element contains the attribute or not, and if the attribute value is an integer, it is chosen numerically. Table 2 gives features obtained from elements. Table 2 contains a boolean, integer, and nominal data types. Features that return a Boolean value, if the attribute name is found or not in the img element. To make this value suitable for machine learning, we convert Boolean values (True or False) to number (1 or 0). Nominal values are string, but they represent a fixed set. For example, the parent elements can have figure, html, div, body, a, td, span, li, noscript, etc. In the literature, the size and height of the image file are generally used. However, a web designer can change the width and height attributes within the img element to change the size of the image shown in the web browser. In this study, the attr_width and attr_height features are proposed. Also, the number of attributes in the element, the number of characters on the entire web page, and the number of characters on the scr attribute have been added to our feature set. Finally, Table 2 presents the values obtained for the first example in Table 1. cannot change the width and height of the image file without using an image processor.

C. LAST FEATURES
After all images are processed for a web page, new features are computed for those images. Since this is the final stage in feature extraction, it is named as last features given in Table 4.
In Table 4, ratio features are the comparison of the image with other images in terms of file sizes. Values of order features obtain from sorting files according to various criteria. The sorting is in descending order. The cluster feature has a value derived from visited URLs for a website. To produce these values, the Levenshtein distances [30] are calculated for all visited URLs. The Levenshtein distance is a well-known string metric for measuring the difference between two strings. A comparison matrix is created that contains all distances for visited URLs. Then, we use DBSCAN (Densitybased spatial clustering of applications with noise) [31] which groups together points that are close to each other.
A crawler begins with a seed/seeds (URL / URLs) for the crawling process. A seed's web page contains images, but none of these images are relevant images. A crawler keeps collecting URLs from other web pages. There are no relevant images on the seeds-like pages. Here, distance and DBSCAN are used to detect page similarity. In this study, instead of using data of web pages, only URL is used for this task. The information gain of all features will be examined in the experimental section.

V. EXPERIMENTS
In this section, firstly information about the dataset and annotating of relevant and irrelevant images are given. The second subsection presents the performance metrics used in this study. The third subsection gives brief information about machine learning methods. The last subsections cover the performance results of experiments. VOLUME 8, 2020  Table 5 gives information about the dataset used in this study.
In this study, our dataset is divided into training and testing datasets to evaluate the performance of our model. The training dataset is a subset to train a prediction model employing machine learning methods. The testing dataset is a subset to predict data with the trained model. However, machine learning methods can be overfitting because there are too many similar web pages for websites. To prevent overfitting, 150 of the websites are used as a training dataset and the other 50 websites as testing dataset.

B. EVALUATION METRICS
There are several metrics to evaluate the performance of machine learning methods. This study focuses specifically on predictive performance on relevant images. Because our dataset is imbalanced. In other words, the number of relevant images in the dataset is much higher. A balanced dataset contains an equal number of samples for each class. However, only 3.57% of the images are included in the relevant class in our dataset. For this reason, our metrics are about evaluating the relevant image correctly.
Accuracy is the ratio of correct predictions to total predictions. However, since we have an imbalanced dataset, we get an accuracy of 0.964 when we annotate all images as irrelevant. Therefore, f-Measure is used to understand the prediction performance of the related images. The f-Measure is the weighted average of Precision and Recall. Precision is the ratio of correctly predicted relevant images to the total predicted relevant images. Recall is the ratio of correctly predicted relevant images to all images in actual classrelevant. Let's define equations of Precision(1), Recall(2), F-measure(3) and Accuracy(4) metrics, respectively.
where TP (True Positives) is the number of correctly predicted relevant images, TN (True Negatives) is the number of correctly predicted irrelevant images, and FP (False Positives) and FN (False Negatives) are the numbers of error predictions. In FP, when actual class is an irrelevant image but the model prediction is a relevant image. In FN, when actual class is a relevant image but the model prediction is an irrelevant image. Moreover, Log Loos [32] is a useful metric to evaluate the performance of machine learning methods. The output of this metric is a probability value between 0 and 1. Log Loos can be calculated as (5).
where y is a binary indicator (0 or 1) if class is the relevant image in prediction for observation and p is predicted probability observation is of class. Log loss measures the uncertainty of the probabilities of the prediction model. A lower log loss value means better predictions.

C. MACHINE LEARNING METHODS
In this study,primarily traditional machine learning methods including Naive Bayesian (NB) [33], Support Vector Machines(SVM) [34], K-Nearest Neighbours (KNN) [35], and Decision Tree (J48) [36] are used to create the prediction model. Then, we use two popular ensemble methods, Random Forests (RF) [37] and Adoboost [38], to improve the performance results of the prediction model. In this section, general information is given about these methods. In experiments, we give the performance results of these methods. NB is a simple supervised machine learning method based on performing Bayes' theorem with naive independence assumptions between the features. In the implementation of NB, we employ Gaussian NB (GNB) that follows Gaussian (normal) distribution. GNB supports continuous-valued features and models each as conforming to this distribution. SVM is a discriminative machine learning method formally expressed by a separating hyperplane. Different kernel functions, namely the linear, polynomial, and the Gaussian radial basis function (RBF), can be applied to SVM. We use the RBF kernel that is especially useful when the data-points are not linearly separable. Moreover, this function has been implemented and proposed by Vyas and Frasincar [9] to find the representative image. KNN is a non-parametric method that calculates weights for features. This method depends on distance for classification, normalizing the training dataset can enhance its accuracy score. J48 (C4.5) is a popular classifier to construct a decision tree that is a map of the possible classes. This tree is used for the prediction of class and observable in a model. With this tree, the explanation for the conditions is easily indicated by boolean logic.
We have an imbalanced dataset where classical methods can have some problems. Recently, ensemble methods provide very successful solutions in addressing this problem [39]. In machine learning, ensemble methods utilize multiple machine learning methods to generalizability/robustness over a machine learning method. In this study, we use RF and AdaBoost which decrease the variance of a single machine learning method as they combine several machine learning methods. As a result, the performance of the learning model increases, and the predictions are much more robust and stable. RF classifier is an ensemble learning method that uses multiple learning algorithms to create a better prediction model. RF classifier is a classification method that contains a combination of tree predictors. Boosting is a sequential ensemble method that decreases the bias error and constructs well-built models. The Boosting term means to turn a weak learner into a strong learner. In this study, we utilize the popular boosting algorithm AdaBoost [40], developed by Schapire and Freund that won the prestigious 2003-Gödel Prize for this algorithm.
In experiments, feature selection [41] is the process of choosing a subset of relevant features by eliminating redundant features. In the present study, we use information gain [36] which can be used in feature selection, by evaluating the gain of each feature in the context of the target feature. This gain shows us the importance of a feature.

D. PERFORMANCE OF MACHINE LEARNING METHODS AND SIMPLE HEURISTIC TECHNIQUES
The number of irrelevant images on websites is huge. For this reason, an imbalanced dataset has been created for this task. In our first experiment, traditional and ensemble machine learning methods are evaluated to understand the impact of this imbalanced dataset. Table 6 shows the performance results for the methods. Moreover, we try some simple heuristic techniques including the biggest filesize, widths, heights, width*height, and width*height greater than 120,000 [22] to highlight the difficulty of the problem. In Table 6, it can be seen that simple heuristic methods are not sufficient for this task, as websites consist of many different web pages. Bhardwaj and Mangat's [22] proposed width*height greater than 120,000 is not suitable for determining the relevant image on many web pages. This technique has the worst performance results. Selecting the largest file size as the relevant image gives the best performance results compared to other techniques.
When the performances of machine learning methods are examined through accuracy, it is seen that the accuracy results are close to each other. However, the numbers of relevant and irrelevant images are very disproportionate. In other words, it is an imbalanced dataset. For this reason, it is better to interpret the results in terms of f-Measure and log loss.
NB is the worst classifier in terms of performance compared to others. On the other hand, it is even lower than the performance of heuristic methods. NB assumes that all the features are mutually independent. These results indicate that features are not independent. SVM is dramatically the second-worst machine learning method, although it has been suggested in the literature. Considering that Vyas and Frasincar [9] reaches an F-Measure of 0.439 to find the representative image in a different dataset, our performance results are very similar for a different dataset and task. As seen in Table,SVM is insufficient for this task. KNN and Decision Tree methods do not give good performance results. The performance of KNN depends on the quality of the data. Although decision tree classifier gives successful performance results from web data extraction studies [5], [42]- [44], this classifier is inadequate in terms of the prediction model in this study. Unfortunately, our imbalanced dataset prevents constructing the desired prediction model. It also shows the difficulty of enhancing the prediction model with classical machine learning methods due to the imbalanced dataset. In this case, the ensemble methods are recommended in the literature.
According to Table 6, AdaBoost and Random Forests give the best results for this task. Recent studies on web data extraction have shown that AdaBoost [13] and Random Forests [45] give good results in different imbalanced datasets. Although the aim of these studies is not to find the relevant image, it is seen that there are similar results. Ensemble methods appear to have a positive impact on performance results. In this way, especially the f-Measure score is improved for the prediction of relevant images. These methods change the habit of overfitting to their training set.

E. COMPARISON OF FEATURE CATEGORIES
In Section IV, features are divided into three categories including textual features, image features, and last features. This section examines the contributions of these three categories. In this experiment, firstly, the prediction model is created using only features in the category and evaluating on the testing dataset. For the prediction model, the AdaBoost classifier is applied, which gives the best performance results in Experiment 1. Then, features in categories are combined order to try to discover the importance of groups. Table 7 indicates the performance results obtained with these categories.
The performance of the prediction model constructed by textual features are poor. It is seen that textual features do not contribute to the model due to the large variability of HTML elements. The prediction model based on the image features gives very good results. In particular, this category covers most of the features such as height, width, and file size recommended in previous studies. These features positively affect the performance of the prediction model. According to the performance results, it is seen that the Last features contribute more than textual features to the prediction model. That is, the image and last features are very necessary and important for this task. Interestingly, the sub-dataset containing the image and last features yield the best results with a score of 0.822 f-Measure. In other words, there is no need to use textual features when creating a prediction model. The next section examines the features in more detail.

F. EVULATION OF FEATURE SELECTION
In this study, we use the scores of information gain to determine the most influential features of our models. Information gain is widely applied for the feature selection process. Fig. 4 shows the information gain scores for each feature.
According to Fig. 4, six of the features including cached, orderFileSize, ratio theimg to websiteimgs, orderWidth, ratio theimg to Pageimgs, orderWidthHeight proposed in this study are among the top 10 features. In particular, the most prominent feature on a website that keeps whether the image is downloaded during the crawling process is the cached feature. The features including Width, Height, WidthHeight, and FileSize recommended in previous studies come after the cached feature. The ratio theimg to websiteimgs and ratio theimg to Pageimgs features, which show the importance of an image according to images on the website and web page, are among the important features. The order features suggested in our study support the construction of a better prediction model.
To examine the performance contribution of the features, the features are divided into groups of 5 according to the information gain scores. These groups are added to the prediction model, starting with the top 5 features. The AdaBoost classifier is used in the creation of prediction models. Performance scores obtained by evaluating these models with the same testing dataset are given in Table 8.
The performance results in Table 8 indicate that the prediction model created by the top 25 features is better than others. In other words, these results show that the last five features are unnecessary for this task. The f-Measure score of the prediction model derived from traditional features is 0.611. In terms of performance results, even the prediction model established with new features is better than the prediction model established with traditional features. The f-measure score of the model created with new features is 0.701. Thanks to the new features proposed in the present study, the performance of the prediction model reaches an f-Measure score of 0.807. Using the last and image features, the prediction model performance reaches 0.822 f-Measure. In other words, using new features enhances performance results by 35%. If the cached feature, which is the most important feature among the information gain values, is not used, the f-Measure score of the model is 0.753. In other words, this feature increases the f-Measure by 0.059 points according to the best model performance result. That is, using this feature provides a 7% improvement in the prediction model.

VI. CONCLUSION
In web data extraction studies, finding suitable features, creating large datasets, and evaluating appropriate machine learning methods are important issues. Conventional studies focus only on a web page. Vyas and Frasincar [9] state that the performance results could get better by using multiple web pages. In our study, a web crawler was developed that can easily create a dataset and extract new features from web pages. A crawler offers a lot of information about web data extraction [10]. With this crawler, a large dataset of 20,000 web pages from 200 websites was created. These new features have increased the performance of the prediction model. In particular, the cached feature obtained from the cache mechanism contributes to the prediction model.
In the present study, we have an imbalanced dataset of 635,015 irrelevant and 22,682 relevant images. This study shows that ensemble methods in machine learning are suitable for constructing accurate models in the imbalanced dataset. It is seen that these methods give much better performance results than conventional methods. Especially the Adaboost classifier gives the best performance results. When all the features are used, the performance results of the prediction model obtained from this classifier has an F-Measure score of 0.807.
The performance results of a prediction model can be increased by removing unnecessary features. In machine learning literature, this process is called feature selection. In this study, feature selection is applied to two different approaches. The first approach examines the feature categories presented in the study separately. In this approach, our experiments indicate that textual features do not contribute much to the classification performance. The f-Measure score of the prediction model created without using textual features is 0.822. In the second approach, the features are listed according to their information gain values. Then, the performance results of models created from sub-datasets are evaluated. In this second approach, a 0.811 f-Measure score is achieved with the top 25 features.
Semantic web technologies enable us to write rules for handling data. The schema.org community sets standard rules to make it easier for search engines in the web data extraction. However, most of the web designers do not follow these rules. Some of them interpret these rules on their requirements. For this reason, the desired development in the Semantic web has not been achieved despite the years of efforts. Automatic web data extraction works can speed up the Semantic web creation process. This study contributes to the identification of relevant images from web pages.
Our web crawler is an open-source project and available via the web page https://github.com/erdincuzun/Image Crawler. Besides, the dataset in this study is available via the web page https://adys.nku.edu.tr/Datasets/. With the web crawler, the dataset can be enlarged or new datasets can be created for studies that need image datasets. For future works, we plan to perform three different studies. The first study is to develop a regular expression generator that uses negative and positive examples of images. In the second study, we aim to find the relevant image only through img elements without downloading images. In the last study, we will try to determine the relevant image through the texture and colour information used in different studies [18], [19], [46]. While doing these studies, the best f-Measure score in our study will be considered as a baseline. He supports education not only with courses but also with lecture notes and blog posts shared on erdincuzun.com. He has also been a member of the Board of Directors of the Faculty since 2017. In 2019, he started to work as the Director of the Çerkezköy Vocational School. He was also a referee and panelist in various TÜBİTAK programs. He has more than 30 publications and more than 150 citations. He is currently supervising five graduate students. His research interests include developing a career that combines teaching and research while maintaining interest in information retrieval, data mining, and natural language processing.
ERKAN ÖZHAN received the degree from Technical Education Faculty, Firat University, Elazig, Turkey, in 2000, the master's degree in computer engineering and the Ph.D. degree from Trakya University, Turkey, in 2007 and 2013, respectively. He has taken many positions as a manager and a referee in research projects in artificial intelligence.
In 2001, he started his academic career at Trakya University, with a focus on studies on artificial intelligence and network security. Since 2013, he has been a full time Academic Member of the Corlu Faculty of Engineering, Computer Engineering Department, Tekirdağ Namık Kemal University. He conducted a project on the application of high performance computing to artificial intelligence and data mining. He has been a speaker at many meetings on artificial intelligence. He teaches undergraduate and graduate courses such us data mining, machine learning, big data, and artificial intelligence techniques. He has authored 14 articles in Turkish and English. His current research interests include artificial intelligence, data mining, big data, machine learning, high performance computing, and network security.  From 2004 to 2009, he was a Research Assistant with the Computer Engineering Department, Engineering and Architecture Faculty, Trakya University. In 2010, he was an Assistant Professor with the Computer Engineering Department, Tekirdağ Namık Kemal University. He has authored more than 20 articles. His research interests include data compression and data processing. VOLUME 8, 2020