A Survey on Skill Identification From Online Job Ads

A changing job market, influenced by different factors such as globalization and demographic growth, urges close monitoring. The digitization of the job market has given the opportunity to researchers to better understand job market needs as job postings/ads become more accessible. However, such postings are submitted in unstructured text and need further processing to identify the required skills. The aim of this survey is to review current research on skill identification from job ads and to discuss possible future research directions. In this study, we systematically reviewed 108 research articles on the topic. In particular, we evaluated and classified the prior work aiming to identify the skill bases used for analyzing job market needs; the type of extracted skills; the skill identification methods; the studied sector and the skill identification granularity. Then, we categorized the existent applications and goals of skill identification. Finally, we present key challenges and discuss recent trends.


I. INTRODUCTION
The introduction of digital transformation and the growth of the internet has deeply impacted our lives, from changing our daily basis interactions to how we look for our future jobs. The digital era made available tremendous amounts of structured and unstructured data, including videos, images, and text, which has paved the way for big data analytics. When such data is effectively and efficiently captured, processed and analyzed, entities can gain a clear and complete understanding of their business to improve efficiency and lower costs. In the same vision, analyzing online job market data can lead to further improvements in narrowing the gap between higher education and job market needs [1].
Today there are many websites dedicated to recruitment and job posting as a consequence of the digitization of the job market. Recruiters can directly post job openings on websites called job boards or job portals that can be accessed easily by candidates. The digital transformation is not only changing jobs, from the destruction to the creation of new jobs, but it is also allowing the opportunity to acknowledge better job market needs. This is done by analyzing the huge job ads The associate editor coordinating the review of this manuscript and approving it for publication was Biju Issac . posted on a daily basis on the internet in an attempt to catch job market dynamics.
Online job ads may not be representative, given that not all job openings are published online. Some recruiters prefer keeping their job opening in a closed circuit for confidentiality concerns or to reduce the time spent filtering and interviewing candidates. Others choose to contact directly specific colleges and universities to share their job openings with students and alumni. From this perspective, even if we collect all online reported job vacancies, there is a share of vacancies that remains inaccessible online, and will therefore fall outside the population sample. Consequently, it remains hard to measure the representativeness of collected job ads even if all online reported job ads are collected. This is due to the fact that the population of job vacancies and its structure is practically unknown [2]. However, job ads remain a good source to understand skill requirements, but not necessarily to estimate the number of vacancies on the market as it seems reasonable to suggest that the core aspects such as skills of a given occupation are likely to remain constant across types of establishments and companies. Using job ads from an established portal and interpreting the results with caution to avoid potential biases can be a valid and acceptable choice. Fortunately, online and internet-based job searching is likely to become an increasingly more prominent tool for job matching, which promises to improve the representativeness of job ads [2], [3]. As in recent years, the amount of labour market information conveyed through specialized internet portals and services have grown exponentially, encouraging and supporting the realization of many internet services and tools related to the labour market, such as job matching services, advertising of job positions, services for sharing curricula, and the creation of professional networks. Therefore, online job data shows great promise in providing real-time insights into current and ongoing workforce needs in terms of skills.
Providing such information can help different entities to make well-informed decisions and data-driven reforms. The analysis of job ads has been regarded as an established approach to illustrate job market needs, and analyze the skill mismatch/alignment, and predict possible future trends. Furthermore, the findings of job ads have provided input for curriculum development and other applications. While analyzing online job ads is a relatively new trend, it builds upon existing empirical literature based on an analysis of traditional, printed job advertisements [4].
Aside from recruitment companies (CareerBuilder, Textkernel) and professional social networks such as LinkedIn, many projects and initiatives were launched to better understand job market needs by monitoring trends in skills and evolving demands of the job market. By accessing data in real-time, policymakers can evaluate training processes and propose improvements in existing policies. Such projects and initiatives encompassed skill identification from online job ads as a way to draw and infer job market needs. Such projects and initiatives can be summarized as follows: • CEDEFOP I and II 1 : the European Centre for the Development of Vocational Training (Cedefop) launched a call for tenders to build a system for analysis of online vacancies and to develop a system or tool to analyze vacancies and emerging skills requirements across all EU Member States, realizing a fully-fledged multilanguage (32 languages) system that collects vacancies, extracts skills, and performs real-time monitoring across all 28 EU Member States to support decision-making.
• ESSnet Big Data project 2 : In 2016, the EU and Eurostat launched the ESSnet Big Data project, involving 22 EU Member States with the aim of 'integrating Big Data in the regular production of official statistics, through pilots exploring the potential of selected Big Data sources and building concrete applications'. The aim was to analyze whether online job vacancies can be used to improve labour market statistics of national statistical institutions. The analyzed data was obtained via Web Scraping or co-operations with employment agencies, job portals or other aggregators. Several countries participated in the project and the main finding was that online job vacancies can indeed be used but instead of scraping them themselves it is better to get them from third parties. Job ads are not only considered as a practical information source for researchers, educators, policymakers, and higher education institutions to explore the dynamic of the job market but they can also be considered as an indicator for job market health. Job ads are the primary means through which companies recruit new applicants for available positions. Therefore, a comprehensive analysis of job market needs requires an information extraction from job ads to leverage the information within these ads as they are submitted in text format.
More specifically, to identify job market needs, Natural Language Processing (NLP) techniques, including information extraction, are used to build personalized bases of skills from textual media knowledge bases, and Machine Learning (ML) based techniques, including deep neural networks and word embeddings, are utilized to capture, as precisely as possible, the skills within the textual source.

A. EXISTING SURVEYS AND CONTRIBUTION
Over the last few years, several research papers and surveys have been published covering different aspects of job market analysis. Applegate [5] examined where reasonably representative job advertisements for academic libraries may be found. This study inspected the data sources used for academic libraries job ads and how representative they are. Kim and Angnakoon [6] identified the methodological approaches and procedures employed in library and information science research that used job ads as data. They evaluated the data collection methods and analysis techniques employed. Papou et al. [7] provided a mapping study on knowledge extraction from online sources for the software engineering job market. They inspected the digital sources, the extracted information, the methods, stakeholders for software engineering job market analysis. However, these contributions did not review the different methods used in skill identification, did not discuss the issue of skill identification granularity, and did not cover the applications of skill identification studies done in different sectors.
We can conclude that existing surveys have inspected the data sources and methods applied in identifying job market needs in specific fields and sectors. Thus, they only covered a part of skill identification methods in the job market. To the best of our knowledge, there is no survey that fully addresses the major aspects of skill identification including different sectors. In this regard, our paper presents a systematic review that aims to fill this gap through an in-depth analysis to cover recent advancements in skill identification of job market needs from online job ads.

B. MAIN CONTRIBUTIONS
In this paper, we present a comprehensive review of related work that builds upon existing skill identification studies.  More specifically, we cover recent publications in the last ten years. We focus on studies and approaches proposed to comprehend and analyze the required skills in online job ads. By examining more than 108 studies, we evaluate and classify the different approaches to identify the skill bases used for analyzing job market needs; the type of extracted skills; the skill identification methods; the skill identification granularity; the studied sectors of identified skills; the application and goals of skill identification. The main contributions of this paper can be summarized as follows: • We present an in-depth comparative survey of research papers related to skill identification from job ads.
• We provide a comparison of the different skill bases used for skill identification and we explain methodologies applied for skill base generation.
• We pinpoint the different methodologies applied for skill identification, the granularity of the identified skills and the studied sectors • We identify the current applications of skill identification from job ads and highlight recent trends.
• We draw insights and discuss future research direction for skill identification from job ads.
The remainder of this article is organized as follows: Section 2 systematically reviews the main available work from the perspectives of skill bases, skill identification methods, skill granularity and studied sectors. Section 3 summarizes the different applications of skill identification and presents contributions of the representative studies and recent trends. Section 4 presents speculations on the future research directions. Section 5 draws conclusions.

II. RESEARCH METHODOLOGY AND RELATED WORK
Studies on skill identification from job ads are gaining increasing interest from researchers. Fig. 1 shows the number of relevant publications in the past ten years. In the last decade, a robust body of research analyzing online job market requirements has been conducted. Nonetheless, an important number of researchers have been analyzing printed job ads in newspapers to identify job market requirements and trends. The first studies conducted to understand online job market needs and determine the required skills by inspecting job ads descriptions used manual content analysis as they dealt with a limited number of job ads. More precisely, manual content analysis was used to discover the scope of skills for a particular position or various occupations in a field.
Then, with the shift of the job advertising medium from newspapers to online job portals, some researchers have resorted to NLP techniques to automate skill identification from job ads. For example, the authors of [8], [9] applied topic modeling to a set of job ads specific to a field or occupation to extract and identify the required skills.
However, the question of the generalization of skill identification was still standing. Therefore, lately, different researchers have presented different methodologies to identify skills from job ads. Some of these researchers proposed the generation of skill bases that encompass the different terminologies of skills to better locate the skills in job ads. Furthermore, human resources experts proposed complete skill bases to better structure skills and competencies such as ESCO, ONET and e_CF. Other researchers proposed ML-based methods to identify skills and sentences containing such skills in job ads.
Once the generalization of skill identification was elucidated, researchers took different directions to better understand job market needs. On one hand, some researchers extended such generalization. In particular, different analyses were drawn for a better understanding of job market needs. Researchers also used the identified skills for job and MOOC recommendation. On the other hand, other researchers tried to weigh skills found in job ads according to their importance, and others tried to measure skill mismatch/alignment.
To identify prior studies identifying skills from job ads, systematic searches were first conducted on several databases. The databases included are Springerlink, IEEE Xplore, ScienceDirect, ACL and Google Scholar. The searches were carried out using the keywords ''job ad'' with their derivative forms, 'job market' and its derivatives and 'skill' and its derivatives. Research papers were reviewed to select studies that analyze the job market through online job ads. Additionally, the references of the articles were added to ensure that all relevant studies on job ads are included in the study.
Therefore, studies that address the problem of skill identification from job ads were included. However, studies that exclusively identify skills from other sources (only resumes or social networks) were excluded. More precisely, the criteria that served as a means of judging the relevance of a study are described next.
The inclusion criteria are: • Studies must be written in English.
• Studies must be accessible online.
• Studies must be peer-reviewed and published in journals, conferences, workshops and poster sessions of conferences.
• Studies must be related to skill extraction from job ads and the purpose of their analysis must be relevant to understand job market needs.
• Studies must be published from January 1 st 2010 to January 1 st 2021.
The exclusion criteria are: • Studies that extract skills solely from resumes, social networks or candidates profiles.
• Studies that are not written in English.
• Studies that do not provide access to the online full-text.
• Books and gray literature (not published in journals, conferences, etc.).
• Studies that present extended abstracts or summaries of conferences/editorials.
• Studies published before January 1st 2010 or after January 1st 2021. In this work, we investigated 108 research articles on skill identification from online job ads to shed light on the ongoing research and its possible future directions. First, we present the different data sources used in the skill identification context. Then, we present the existing skill bases developed by human resources experts and researchers. Finally, we categorize the representative work according to the skill identification methods, the skill identification granularity, the studied sectors. Figure 2 shows the components of skill identification from the job ads framework. In the next section, we present the applications and goals of skill identification from job ads.

A. ONLINE DATA SOURCES
The research on skill identification from the online job market mainly depends on analyzing online job ads to identify the required skills. Skill identification is also performed on resumes found in social networks or job portals, curricula and certifications. Next, we define the terms used throughout this paper: • Job ad: a job posting, also known as a job ad, is an announcement that informs people that a certain job position is available. This announcement is written, generally in an engaging tone, and describes the job position. It has a title and a description. The description provides details about the position, including skill requirements. This announcement is posted on recruitment websites and job portals to attract the right candidates [6], [10].
• Curriculum: the curriculum is an ''academic plan'' that usually includes the purpose of the curriculum (i.e., goals for student learning), the content of the curriculum, the order of the learning experience, the instructional methods and resources, the evaluation approaches, and how adjustments to the plan will be made [11], [12].
• Candidate profile: a candidate profile is a textual document that describes the different skills and knowledge acquired by the candidate. They range from candidate profiles found on professional social networks to resumes submitted by candidates in job portals.

B. SKILL BASES
A knowledge base is a collection of records in a database, which typically refers to some kind of ''knowledge'' about the world. Different knowledge bases were constructed such as YAGO and DBpedia. YAGO is a knowledge base obtained from Wikipedia's infoboxes and from linking Wikipedia's rich category information to the WordNet taxonomy using basic NLP techniques [13]. DBpedia [14] is a community effort to extract structured information from Wikipedia. In the same context, different domain-specific knowledge bases were developed for specific domains. Such bases contain different terminologies and entities related to a certain domain such as skill bases. Domain-specific knowledge bases can be constructed from the extracted information or by experts. More specifically, the huge amounts of texts published online help in understanding word associations and synonyms which improves information extraction from text and knowledge base construction. In this section, we review the previous studies on skill identification from job ads in terms of the used skill bases such as predefined skill bases and customized skill bases.
In prior work, different skill bases were manually and automatically developed to serve skill identification and standardization of different occupations and sectors. In the literature, we find different terms for skill bases such as taxonomy or ontology. A taxonomy formalizes the hierarchical relationships among concepts and specifies the term to be used to refer to each; it prescribes structure and terminology. An ontology identifies and distinguishes concepts and their relationships; it describes content and relationships. Moreover, an ontology allows multidimensional relationships to be defined. Thus, ontologies and taxonomies are structured skill bases.
These skill bases contain different information about the skills and competencies of different occupations. According to [15], competence or tacit knowledge or soft skill is the ''ability to use knowledge, skills and personal, social and/or methodological abilities, in work or study situations and professional and personal development''. Skill or competency or implicit hard skill or explicit hard skill is the ''ability to apply knowledge and use know-how to complete tasks and solve problems''. Also in [16], they consider competence as something ''that can demonstrate the application of a generic skill on some knowledge''. Based on the two definitions, we can say that competencies and skills are used interchangeably wherein one context competencies can be soft skills or hard skills in another context.

1) SKILL BASES DEVELOPED BY EXPERTS
To structure competencies and skills, skill bases were developed by experts as a reference to human resources services. Table 1 gives a comparison between those skill bases in terms of skills definition, types, occupations, domain independence and languages. Those skill bases are as follows: • ESCO 3 [17]: this is a project that categorized skills, occupations and other relevant competencies in different languages. The ESCO skill base encompasses 13485 skills and competencies. There is, however, no distinction between skills and competencies. Each of these concepts comes with one preferred term and several non-preferred terms in each of the 27 ESCO languages. Every concept also includes an explanation in the form of a description and a type. 4

ESCO has been
developed in an open IT format, is available for use and can be accessed via the ESCO service platform.
• O*Net base [18]: this is used by the U.S. public recruitment services. The core concepts are the job categories, that are associated with generic skills as well as ''tools and technology''. The skills nomenclature is very coarse with few general skills.
• ROME: this is a repository that contains the definition of different occupations, developed by Pole Emploi. 5 The occupations descriptions are written in French and contains different fields such as the different titles for the occupation, the required skills and qualifications.
• ISCO-08: this is a tool developed by the International Labour Organizations for organizing jobs into a clearly defined set of groups according to the tasks and duties undertaken in the job. Its main aim is to provide a basis for the international reporting, comparison and exchange of statistical and administrative data about occupations. 6 • e-CF: this is the European e-Competence Framework based on competencies. Its purpose is to provide general and comprehensive e-Competences specified at five proficiency levels that can then be adapted and customized into different contexts from the ICT business and stakeholder application perspectives. 7

2) CUSTOMIZED SKILL BASES
• Manually Built Skill bases: Different studies in the literature analyzed online job market needs by manually building their skill base or list of skills/keywords. Some studies used a predefined list of skills to identify the skills from a set of job ads. These predefined lists of skills can be harvested from prior work or the web; for instance, the work in [20] relies on the list of programming languages, found online. Other studies developed their list of skills from scratch by identifying keywords through a manual selection from the most recurring words in the collected data [21]- [23]. We can observe from Table 6 that such manual skill base construction is performed to study sector-specific job ads. Likewise, Boselli et al. [24] enriched the ESCO skill base by capturing the relevant n-gram expressions. An n-gram is a sequence of n words; for instance, a bigram is a two-word sequence. Such n-gram expressions were mapped to ESCO skill labels by leveraging strings similarities and validated by domain experts.
• Generated Skill bases from Knowledge Bases and word embeddings: Knowledge bases for skills are called skill bases as they contain a list of skills and/or competencies required by candidates to fulfill the requirements of a specific job position. In an attempt to understand those requirements, various skill bases were developed [25]- [28] as depicted in Table 2.
Most of the developed skill bases mainly relied on external knowledge bases such as DBpedia and Wikipedia to enrich the skill base. Javed et al. [26] and Zhao et al. [28] examined the skill sections of resumes and Wikipedia categories to define and develop a taxonomy of professional skills. Malherbe and Aufaure [27] built a skill base that relied on page redirection in DBpedia, where skills were first extracted from candidate profiles found in various professional social media, and page redirections in Wikipedia were added to these skills as aliases or alternative labels. In [29], they built a soft skill taxonomy by combining learned word embedding from a dataset of job ads and DBpedia hyperlinks. Similarly, in [30], they used word embedding combined with an in-house named entity recognizer and POS tagging to detect relevant skills that are stored in a skill base for skill extraction task. Also in [25], they used a data-driven approach to build a comprehensive skill list. They leveraged data found in LinkedIn profiles to harvest skills. Then, using Wikipedia pages they deduplicated repeated skills, where skills having the same relationship to the same Wikipedia page are considered related to each other.
In [31], the authors started crawling from the node ''Software Engineering'' in the DBpedia ontology graph to build an IT skills ontology/base using a depth limited Breadth-First Search algorithm. In [32], the authors relied on word association to build an ontology that bridges between job and knowledge elements found in job ads. Instead of generating a base from scratch, in [33], the authors have leveraged open knowledge bases where they used different techniques such as random walk and different algorithms to extract and infer skills from Wikipedia graph.

C. SKILL IDENTIFICATION METHODS
In the skill identification literature, we can distinguish between four methods. The first method is based on a skill count between the skill base and the job ad. The second method is based on topic modeling to extract skills cited in job ads. The third method relies on skill embedding where skills are validated by leveraging word embedding. The fourth method is based on Machine Learning (ML) techniques such as text classifiers and deep neural networks to classify job ad sentences or tag skills. The previous studies on skill identification are summarized in Table 3.  Figure 3 presents the publication trends of skill identification methods during the past ten years. The skill count method has received continuous attention, most likely because of its simplicity. Recently, an increasing attention has been paid to ML-based methods.

1) SKILL COUNT
Skill count without a prior skill base remains the most reliable method for skill identification. However, it is time-consuming as job ads are read by annotators. Conversely, skill count with a skill base is simple to implement and give insights about skill requirement quickly. The advantage of skill count is to translate the requirements of employers expressed in job ads. The drawbacks of such method are its inability to capture the context of the words and may identify false-positive skills.
Skill count remains the most popular method in skill identification. Such count can be done using a skill base or rely on expert judgment and annotators to label and tag skills manually in a given job ad.
The presence of a skill term can be calculated as a boolean value, where the weight of a skill is 1 if the skill term exists in the job ad, 0 otherwise; or by the classical TF/IDF schema in which the word weight, indicating its importance to the topic, is computed in terms of a trade-off between its frequency in the job ad and in the entire corpus [10]. Once the skills are counted, statistical analysis is performed to highlight the frequency of skills within the studied dataset of job ads.

• Skill Count using Skill Bases
Several studies have examined the job skills required by the job market using an exact match of skills in job ads to compute the skill frequency. In order to identify skills in job ads, prior work [15], [27], [36] relied on an exact match with the used skill base or the defined list of skills. The exact match can be done through the exact name of the skill, the alternative labels or aliases of the skill. Also, an exact match of skills in job ads can be performed using a simple keyword search or combined with regular expressions.

• Skill Count without Skill Bases
Skill count without skill bases relies mainly on expert judgment. Such count is performed either by manually analyzing the content of the written job ads or by curing and inspecting the list of the most frequent terms in the corpora of job ads. Manual content analysis is a data analysis technique used in qualitative research. Content analysis determines the presence of certain themes or concepts within a given qualitative data to quantify and analyze the presence, meanings and relationships of such themes or concepts. Traditionally, content analysis is performed manually by reading the content of each document in the study. More recently, computerized content analysis has been proposed to sift and organize the content. Analyzing the list of the most frequent words is performed by experts. They start by identifying the most frequent terms using computerized solutions. Then they clean and identify the skills within this list.
In the context of analyzing the job market needs, several studies used content analysis to identify the required skills such as [21]- [23], [34], [35], [37], [48]- [50]. Generally, content analysis is used to compute the frequency of skills in specific sectors or domains as depicted in Table 6. Manual analysis can be superior to computerized analysis in terms of the accuracy of the results because words have different meanings in different contexts. However, it is much more time-consuming than computerized analysis and may not be effective for periodic analysis of large amounts of text information [21].

2) TOPIC MODELING
Topic modeling is an unsupervised method that learns the set of underlying topics, in terms of word distributions, from a set of documents. Formally, topics contain the important keywords cited in the text. Topic modeling has seen a growing use for various applications, including direct applications to the unsupervised analysis of textual corpora [51]. However, such methods require a final interpretation of the results by domain experts. Thus, the main studies that applied such methods focused on one area, sector or occupation as the interpretation of the final result is time-consuming. The advantage of such methods resides in their unsupervised nature, where no skill base is required. Different studies identified the required skills by using topic modeling algorithms. They consider the topics of each job ad as the top required skills. Then, they analyze the keywords of the most cited topics and capture the cited skills. Various topic modeling methods were applied. In [8], the authors used Latent Semantic Analysis (LSA), which is comprised of three phases, to identify the most sought skills in a job ad. The first phase of LSA consists of transforming a collection of documents into a term document matrix. In the second phase, the term-document matrix undergoes a Singular Value Decomposition (SVD) to reduce the dimensionality of the term-document matrix without losing essential information by identifying groups of highly correlated terms (i.e., terms that co-occur together in documents) and highly correlated documents (i.e., documents that contain similar terms). The result of the SVD is a set of factors (topics) with associated high-loading terms and documents. Together, they form patterns of words that represent topics in the underlying collection of documents. The extracted word patterns are interpreted in the third phase, which usually involves additional statistical analyses and, most importantly, expert judgment [52]. LSA depends heavily on SVD which is computationally intensive and hard to update as new documents appear [53].
Similarly, in [9], [41], [42], the authors used Latent Dirichlet Allocation (LDA) to identify the most popular keywords referring to skills in job ads. LDA is an unsupervised generative probabilistic method for modeling a corpus. LDA assumes that each document can be represented as a probabilistic distribution over latent topics, and that topic distribution in all documents share a common Dirichlet prior. Each latent topic in the LDA model is also represented as a probabilistic distribution over words and the word distributions of topics share a common Dirichlet prior as well [51].

3) SKILL EMBEDDING
Many studies leveraged skill embeddings to disambiguate skills and create clusters of skills that appear together (see Table 3). The skill embedding methods were proposed to respond to the limitations of skill count method. Moreover, skill embedding offers the possibility to tag skills from different sectors. The term Skill Embedding refers to word embedding models trained on a collection of job ads. The training aims to produce vector representations of skills such that similar skills and co-occurring skills are close to each other in the vector space. The distance between vectors is measured by metrics such as the Cosine or Jaccard similarity coefficients. Such methods reduce the false positives when performing skill identification. However, this method may VOLUME 9, 2021 increase the number of false negatives. By filtering the identified skills using skill embedding, we may exclude some true extracted skills if they do not pertain to the cluster of skills.
Word embeddings are a class of techniques where individual words are represented as real-valued vectors in a vector space. Each word is mapped to one vector and the vector values are learned from the text.
Once the word embedding model is trained, different similarities between skills can be computed to validate the presence of the skill in the job ad as skills pertaining to the same cluster of skills tend to be close to other skills. Different word embeddings were computed from job ads. For instance, in [26], [28], the authors trained the word2vec model on a collection of job ads to obtain a vector representation of the context of skills. These vectors are then used as input to a clustering algorithm such that in each cluster the aggregated contexts (represented by vectors) can be used to determine a skill sense. For example, such vectors help in differentiating between the term 'BI' that refers to business intelligence and the same term that is also used to refer to the Bank of Indonesia.
Similarly, [54] used word2vec embeddings [55] to compute the similarity between skills cited in job ads and professional standards. The latter are a set of practices, ethics, and behaviors to which members of a particular professional group must adhere. Word2vec was trained on job ads corpus to learn from the context of a job ad and to cluster skills according to their occurrence in job ads.
In [43], the authors used skill embedding (Fastext trained on job ads) to guarantee the coherence of extracted skills, i.e. the vectors of the extracted skills should be close to each other in the skill embeddings vector space. FastText [56], is an extension of word2vec for scalable word representation and classification. One of its major contributions is considering sub-word information by representing each word as the sum of its n-gram vectors. This approach handles out of vocabulary cases since it can generate a representation of a vector close to the original word despite misspellings. One other advantage over word2vec is that rarely occurring words in the corpus are better represented in the word embedding space.
In [30], after identifying explicit skills, the authors infer implicit skills for a job ad using Doc2vec where similar job descriptions that share common features such as location or company share similar skills. Implicit skills are skills that are not explicitly cited but are required by the position. Doc2Vec [57] computes a feature vector for every document in the corpus while Word2Vec computes a feature vector for every word in the corpus. Doc2vec model is an extension of word2Vec to learn document-level embeddings. Based on this similarity, the authors add inferred skills to the skills extracted from the job ad.

4) ML-BASED METHODS
Recently, a series of breakthrough advances in artificial neural networks and ML-based methods has resulted in considerable success in several areas in NLP. There are two promising techniques in the context of skill identification from job ads. The first is Named Entity Recognition (NER) using deep learning. NER is the computerized procedure of recognizing and labeling entities in given texts. NER is also known as (named) entity identification, entity chunking, and entity extraction. It is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. Considering that the text is a sequential data format, Long-short Term Memory (LSTM) deep learning methods are commonly used to tackle NER problem.
In the skill identification context, typical entities are the different skills cited within job ads. These deep learning models allow detecting skills by parsing the job ad text. For example, in [44], the authors used an LSTM model, pretrained for skill NER [58], to recognize skills from a job ad text.
The second technique is text classification, which is the process of assigning tags or categories to text according to its content. It is one of the fundamental tasks in NLP with broad applications such as sentiment analysis and spam detection. It was applied by [45], [46] to classify sentences that contain skills in the job ad description. The authors labeled a huge dataset containing sentences of job ads. Once the sentence is identified as containing a skill, the skill cited within is extracted. Sifting such sentences can help in accurately identifying skills within job ads and prevent false extracted skills. In [45], the authors compared Convolutional Neural Network (CNN) and LSTM models for sentence classification. CNN model allows taking into account the word order by applying a fixed-size window on the input array composed of words and their corresponding word embedding dimensions [59]. LSTM leverages the sequential nature of the text, handles long-term dependencies and allows making predictions on a variable-length input.
Similarly, in [46], the authors leveraged state-of-the-art language modeling approaches such as BERT to classify sentences found in the job ads. BERT is designed to pre-train deep bidirectional representations from the unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create models for a wide range of tasks, such as question answering and text classification, without substantial task-specific architecture modifications [60].
Moreover, instead of using NER, the authors of [47] used multi-label text classification to label each job description with the skills within. Instead of classifying the type of each word appearing in the job description, the authors considered job descriptions as the evidence for the binary relevance of thousands of individual skills. To do so, they leveraged BERT encoder and added another layer to perform multilabel classification. They also added a Correlation Aware Bootstrapping process(CAB) that encompasses structured semantic representation of skills and their co-occurrences to acknowledge the missing (implicit) skills cited within the job ads by increasing the number of training examples.
Deep Learning-based methods are very powerful at capturing the hidden relationships between words and showed promising results in different NLP tasks. However such methods remain data greedy and require a significant labeled dataset with rigorous fine-tuning to give good results in the skill identification context.

D. EVALUATION METRICS
The performance of methods in identifying skills within job ads is evaluated using various metrics according to the method.
For skill count and skill embedding methods, studies evaluated the extracted skills from job ads by computing true and false positives as well as true and false negatives from the extracted skills. Recall, precision and F1-score are then computed on the basis of the evaluation of experts. Precision is the fraction of correctly extracted skills out of all the predictions of a particular class. Recall is the fraction of correctly extracted skills out of all actual members of the class. F-Measure is the harmonic mean of precision and recall. To compute such metrics, studies use either domain experts or sampling-based users survey to evaluate the performance of skill identification methods. More specifically, these experts and users are invited to verify the accuracy of the extracted skills within the job ad and add the missing skills.
For topic modeling methods, no evaluation is needed as the method is unsupervised and learns from the words within the job descriptions. Once the results are generated, experts inspect and interpret the generated topics. For ML-based methods, most studies focus primarily on binary classification task. Thus, studies rely on precision, recall and F-score to evaluate their models. For the multi-label classification task, the authors of [47] used Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain(nDCG) and recall to evaluate the performance of skill extraction. MRR indicates the position of first true positive in the predicted list of skills. It yields a score between 0 and 1 depending on the position of true extracted skills. To rank the relevance of the returned skills, they also used nDCG score that discounts the true positive skills that occur later in the extraction rankings.
Many studies depend on the implication of experts for evaluating skill identification since publicly available labelled datasets for skill identification from job ads do not exist. More recently, in [47], the authors released a labeled dataset of job descriptions with the required skills. Such dataset could facilitate the evaluation and comparison of skill identification methods.

E. SKILL IDENTIFICATION GRANULARITY
In skill identification literature, we can distinguish between three approaches of the identified skill identification granularity. The first approach identifies skills as single words or multi-word phrases. The second approach identifies skills expressed in sentences. The third approach combines the two approaches where skills are identified as sentences and n-grams. In this section, we review the previous studies on skill identification of the job market in terms of skill identification granularity and summarize them in Table 4.
Identifying skills as n-grams is popularly applied to analyze skill requirements from job ads. Due to the complexity of dealing with sentences, we notice a small fraction of studies that deal with such skills.

1) SKILLS AS N-GRAMS
Identifying skills as n-grams deals with job descriptions as a bag-of-words model. Skills are identified as single words or multi-word phrases, from bi-grams up to 4 grams to identify composed skills. First studies inspecting skills from job ads identified skills as n-grams [21], [61]. Through the different skill identification methods as depicted in Table 4, researchers captured skills as n-grams from job ads. In skill count, researchers performed naive skill matches of single words or multi-word skill terms. In a similar way, the emerging topics from topic modeling were composed of words or multi-word that express skills. Then, researchers resolved to skill embedding [26], [43] in order to reduce the number of false extracted skills expressed as uni-grams or multi-grams. Such identification works well on identifying programming languages, natural languages, concept words and even soft skills.

2) SKILLS AS SENTENCES
In natural language, the full meaning of a sentence is determined by the words and the syntax. The disadvantage of tagging skills as single words or multi-word phrases is the loss of the structural relations between words in the sentence, which can limit the skill identification. For example, sentences describing the company may cause false positives extracted skills. Therefore, some researchers have taken a further step by detecting sentences containing the skill term as a first step, where deep learning models were used to label job description sentences. For instance, [45], [46] identifies the sentences containing the soft skills and hard skills consecutively, then skills are identified as n-grams from a predefined skill base or list of skills.
Another limitation of skill identification using n-grams resides in its inability to distinguish between sentences like ''Follow the schedule'' and ''Schedule tasks'' where matching with words will fail and identify 'scheduling' as a skill in both tasks. In [62], the authors use pattern mining to identify emerging skills expressed as sentences with action verbs. Based on stemmed sentences, they examine the occurrences of such sentences to highlight the emergent competencies. Moreover, some researchers have resorted to extracting manually skills from job ads to better identify skills expressed in all formats (sentences and n-grams) [37].

F. STUDIED SECTORS
Existing studies identify the required skills from job ads at different levels and sectors. The skill identification start by analyzing skills from a single sector or an occupation. Such a choice was made in order to keep the number of job ads manageable. Recently, the growing number of complex data collection technologies has increased the demand for large-scale and real-time skill identification in the job market. Therefore, researchers have addressed the issue of skill identification from all online job ads without distinction. In this section, we review the previous studies on skill identification of the job market in terms of the studied sectors i.e., sectorspecific and multiple sectors and summarize them in Table 5.

1) SKILL IDENTIFICATION FROM SPECIFIC SECTORS
Different sectors were analyzed in the literature. In Table 5, we can see the different studied sectors such as IT, healthcare, renewable energy, operation research and engineering. However, the majority of the studies targeted IT to better acknowledge the soft and hard skill requirements of different IT occupations through the analysis of online job ads. One explanation of such interest is the large volume and availability of IT job ads, which represent the majority of posted job ads [3].

2) SKILL IDENTIFICATION FROM MULTIPLE SECTORS
Recent studies proposed to identify skills from multiple sectors for a comprehensive skill identification and to shed light on skill requirements from multiple sectors without exception. Such studies deal with a more important job ad collection where the identification is automatically performed to deal with the huge number of job ads. To do so, researchers relied on skill bases to generalize the skill identification from multiple sectors. In [26]- [28], the authors generated a comprehensive skill base relying mainly on Wikipedia knowledge base. Besides, customized skill bases were manually tailored for the same purpose. In [24], [70], the authors enriched the ESCO skill base by capturing the relevant n-gram expressions. Then, they tagged skills from job ads of different sectors. Moreover, Skill embedding were leveraged by different studies [30], [43], [71] to disambiguate skills of the different sectors as depicted in Table 3. Furthermore, the authors of [47] used a multi-label classifier to tag skills from job ads without any distinction of the sector related to the job ad.

III. APPLICATION AND EXPERIMENT OF SKILL IDENTIFICATION
Skill identification through online job ads is performed for different applications and goals. In particular, understanding the job market needs is a direct application, where the required skills are extracted for different occupations, sectors and industries. Other applications have been derived such as skill mismatch, Job & MOOC recommendation, skill demand prediction and skill salience measurement. In this section, we review the previous studies on skill identification of the job market in terms of its applications and goals, i.e., skill extraction and job market analysis, curricula design, job & MOOC recommendation, skill mismatch & alignment, skill demand prediction and skill salience measurement; these studies are summarized in Table 6.

A. EXISTING APPLICATIONS OF SKILL IDENTIFICATION 1) SKILL EXTRACTION AND JOB MARKET ANALYSIS
Job market analysis is incontestably a valuable input and a resourceful asset for many entities, e.g. Human Resources departments to better acknowledge their current and future VOLUME 9, 2021 needs, and educators, professionals and job seekers to meet the industry needs.
In an attempt to understand job market needs, several studies extracted skills from job ads and other data sources. The requirements of employers and recruiters are extracted and identified through the posted job ads. Some studies focused on understanding hard skills requirements, while others examined soft skill needs. Such studies varied in scope; [22], [23], [65] inspected specific sectors, while [25]- [28] proposed skill extraction systems that deal with job ads from multiple sectors in order to draw a comprehensive job market analysis.
Moreover, in [20], the authors studied multi-perspective popularity of skills; they identified popular skills by examining the generated graph of co-occurring skills from different perspectives (salary-wise or company-wise). Furthermore, Giabelli et al. [70] generated a graph from online job ads and developed skill bases of different countries to compare and evaluate country-based labor market dynamics for supporting policy and decision-making activities at the European level. The graph encompassed different nodes such as skills, occupations, cities and sectors.

2) CURRICULA DEVELOPMENT
One direct application of the identification of the skills required by the job market is the development or redesign of the curriculum. A curriculum that responds to the job market needs will offer graduates the necessary skills to find their first job. The purpose of [68], [69], [89] was to provide necessary information for curriculum development by identifying the skills that should be encompassed in the program. These studies were mainly conducted for the IT sector to identify the technologies and programming languages sought after in the job market.

3) JOB AND MOOC RECOMMENDATION
The main idea of a job recommendation system is to provide a set of job recommendations in response to a user's current profile. In these systems, the users can typically upload their skills, resumes or their job search criteria; similarly, the employers or their agents can upload a job ad or skill sets needed along with other information such as location, position and other job-specific details. More recently, MOOC recommendation systems were proposed. Such systems find and recommend relevant MOOC materials to master the skills required by the job market.
Many studies built recommendation systems for jobs and MOOCs using the skills identified from job ads. In [130], the authors presented a data-driven job search engine. This consists of comprehensive search filters including user skill set-focused attributes and various company attributes. The returned jobs are filtered according to the extracted skills. The authors of [131] proposed Jobsense, a framework that gathers and integrate jobs, skills and careers data from multiple data sources such as job ads. In [30], the authors extracted explicit and implicit skills from job ads and candidate resumes for an accurate job recommendation. For the same purpose, in [43], the authors built a job understanding model to improve the representation of jobs and proposed an improvement for the job posting flow in LinkedIn. In [132], the authors built a job recommender that jointly learns the representation of the jobs and skills in the shared k-dimensional latent space of job transition network, job-skill network, and skill co-occurrence network. In a similar way, the authors of [133] proposed a recommender system that, starting from a set of users' skills, identifies the most suitable jobs as they emerge from a large dataset of online IT job ads, which were processed and represented as a graph of occupations and skills.
Likewise, [121] built a MOOC recommender based on skill requirements found in job ads. This was developed using a machine learning model to predict whether a given video fits a skill extracted and required by the job market. Similarly, the authors of [134] examined the dynamics between online learning platforms and online hiring platforms in the software programming profession. By combining four data sources together, Stack Overflow, Google Trends, Udemy, a platform offering skill-based Massively Open Online Courses (MOOCs), and Stack Overflow Jobs. One important finding of the study is that it takes only a few months between the first Stack Overflow appearance of a new skill and its first appearance on Udemy or Stack Overflow Jobs.

4) SKILL MISMATCH AND ALIGNMENT
Skill mismatch is a prevailing issue around the world. There is a growing concern that skills available among the workforce cannot meet the fast-changing demands of the economy, thus creating a major barrier to growth and development. A fast-changing job market, affected by several factors and trends such as globalization and demographic change, gives an impression of an expanding skill gap and brings greater urgency to policy implementation. There are different types of skill mismatch, covering many forms of job market asymmetry. We can cite skill gaps and skill shortages. Skill gaps are when workers lack the skills necessary to do their jobs effectively. Skill shortages are when employers cannot find enough professionals with the right qualifications and skills [135]. They both express an asymmetry between skill supply and skill demand, which made their comparison a focus of different studies.
A skill shortage tool was proposed in [117]; the tool detects skill shortages by correlating different features in the job ad such as its location, salary level and the job ad period being advertised. In [85], they developed a data-driven solution to detect skill shortages from online job ads data. To do so, they identified skills specific to data science and analysis, to capture the labor trends of such an occupation. Then, they inspected the evolution of such skill set in a collection of job ads through different dimensions such as salary levels, education requirements, required experience and posting frequency. Moreover, the authors of [16] tested the alignment of academic profiles and job advertisements, thus detecting the academic topics with which the job offers are most aligned.
We can also cite [61], which is one of the early studies that examined skill mismatch through skill identification from job ads. The study identified current needs and requirements of IT skills through online job ads. Moreover, the authors inspected universities' online courses and module descriptors of UK academic institutions to shed light on the skill mismatch between higher education and job market needs. In [66], based on IT job ads, IT graduate profiles, IT certification schemes and IT national job classification, the authors compared the skills needed in each data source to measure the mismatch.

5) COMPETITIVE INTELLIGENCE AND TALENT SEARCH
Staying attentive to the job market needs does not only benefit job seekers, it is also considered a valuable intelligence for companies to stay competitive in the market. By acknowledging the trending skills, companies can not only adapt and offer innovative services, it can also help companies to keep track of the most sought-after skills, thus guiding organizations in composing effective portfolios. In particular, sector-specific studies focused on identifying the popular and the most sought-after skills to better understand job skills. In [8], [9], [40], [41], [127], the authors inspected the skills in data-related positions, such as business intelligence and big data roles, in order to build a classifier of required skills in such roles. This classification provides a framework that organizations can use to inventory their existing workforce competencies and establish clear strategies for the acquisition and the development of the right skills needed to become more data-driven.
Attracting the best candidates and profiles remains a prevailing concern for most companies. Therefore, companies are constantly innovating in recruitment practices, by boosting the employer brand, hiring headhunters, and are very active on social networks. In [128], the authors proposed a skill identification tool that serves talent search. By identifying both skills cited in the job ad and candidate profiles, the tool can find the best candidates available in resume databanks. In [136], the authors proposed a person-job fit neural network that measures the fitting between job requirements and candidate resume. Anticipating future trends is beneficial to make decisions following the evolution of the environment. In the job market context, anticipating future needs will allow job seekers to adapt rapidly. A fast-changing job market, affected by several factors and trends, has fast-changing requirements that need close monitoring. Therefore, the identification of any new trend regarding required skills in the job market will enable job seekers, education and training organizations better respond to such needs. In [62], the authors inspected such needs by identifying emerging competencies through pattern mining of skills expressed with action verbs in job ads. In [129], the authors used time series to discover highdemanded skills. Using co-occurring skills observed in job ads, they generate a skill graph representing skills as nodes and denoting edges as the co-occurrence appearance. Then they inspected the evolution of the clusters of skills over time.

B. RECENT APPLICATIONS OF SKILL IDENTIFICATION 1) SKILL SALIENCE
Skill salience focuses on identifying the most significant skills to the core job function and occupation. The term 'Salience' was introduced in [137], where the authors argue that not all skills are required equally, as some skills may be more important than others to the core job function. They relied on different signals to identify and measure the importance of skills. Such salience is computed through labeled segments and sentences of the job ad. Similarly, the authors of [133] computed a measure of skill importance for each occupation in each country, using the Revealed Comparative Advantage (RCA) measure, inspired by the work in [138]. This enables to focus on skills that are over-expressed in occupations.
A prior study [10] has tried to quantify skill relevance to job titles using the term-frequency-inverse-document frequency (TF-IDF) model while computing the frequency of skills. The authors argue that if a skill is required by many job ads having the same job title, it is likely to be an important skill.

2) GENDER BIAS
Examining gender bias in job ads has emerged as a new application of skill identification. The purpose of this application is to measure gender bias and inequality while writing the job ad. It could also shed light on gender bias in the job market and promote fairness by reducing inequality.
In [39], the authors examined soft skills required by women and men in job ads and found that soft skills are associated with gender segregation across occupations and reinforce wage inequalities between men and women by rewarding typically ''male'' characteristics and penalizing ''female'' traits. In [139], the authors examined how job seekers react to requirements in job ads. More precisely, they investigated the impact of job requirements on women's job attraction and decision to apply. To tackle these inequalities, a tool was proposed by [140] to detect keywords that discourage women from applying and improve gender-neutrality in job ads.

IV. DIRECTIONS FOR FUTURE WORK
In the past decade, many studies have been conducted to identify skills from job ads. The availability of job ads online has unlocked many horizons for researchers and gave them the opportunity to examine job market needs at a large scale. Moreover, with the increasing power of artificial intelligence techniques, especially the ability to digest unstructured text, extracting more valuable and accurate information from Web media and capturing job market signals and needs with greater sensitivity and precision have become possible. We believe that the study of job market needs, and more specifically skill identification from job ads, will continue for a long time to come. In this section, we briefly describe areas or aspects that we believe are in need of future research and advancement.
• Skill Identification using Deep Learning: Some studies have already explored the use of deep learning for skill tagging from job ads, e.g. [44]- [46]. However, to the best of our knowledge, a comprehensive skill recognizer that leverages deep learning to detect skills expressed in their different terminologies is still lacking. Therefore, investigating the use of recent deep learning techniques for skill identification is an interesting direction for researchers interested in uncovering the job market needs.
• Skill Identification through Graph Embedding: Uncovering skill connections through graph embedding seems to be a promising area of research. Some studies have already explored the use of graphs in examining skills in IT job ads [20], [129]. Generalizing such studies to other sectors would be interesting.
• Understanding Skill Demand Evolution: Predicting skill demand can help better acknowledge future job market needs. Such information is valuable to different entities as it sheds light on a constantly changing job market. Some work has already examined skill demand prediction (see Section III-A6). Such prediction could be performed through a temporal analysis of skill demand such as time series analysis and emergence detection. To perform such a task, historical data is required to detect the variability in skill demand and then build predictive models. In particular, combining such data with other sources such as scientific papers can help shed light on the evolution of new skills from their first introduction to the job market to their surging demand.
• Level of expertise: It would be interesting to examine and measure the importance of skills and level of expertise required in job ads. Such a task could be performed by examining the context of the skills where optional skills should be given less attention than the required skills. More specifically, such scaling of skills could be done through classifying skills as either 'important', 'required', or 'optional', and expertise as 'high expertise', medium expertise' and 'basic knowledge'.
• Skill Bases Generation: Many studies constructed skill bases for skill extraction from job ads. However, it would be interesting to construct a skill base that encompasses a complete occupation description with the required tasks, skills and education requirements. Such evolving skill bases would be valuable in detecting emergent skills and jobs.
• Open Datasets for the Skill Identification Task: Open datasets of job ads with the required skills mapped to the different skill bases can help improve the skill identification methods. Moreover, such datasets would structure the task of skill identification from job ads by evaluating the performance of different identification methods on the same benchmark datasets.

V. CONCLUSION
In this study, we systematically reviewed 108 research articles on skill identification from job ads published between 2010 and 2020. We presented, in particular, a comprehensive survey on skill identification. We provided a framework for dividing the challenge of skill identification into three main issues, i.e., skill base generation, skill identification method, and skill identification granularity, to clarify the paths for further improvements.
We detailed the different applications of skill identification to clarify the importance of such information extraction. These include the identification of skill mismatch, which is a prevailing issue around the world, job recommendation, and job demand prediction. Finally, we provided insightful suggestions for future studies.

ACKNOWLEDGMENT
The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied of Google.