Job forecasting based on the patent information: A word embedding-based approach

The rapid change in technology makes it challenging to forecast the future of jobs. Previous studies have analyzed economics and employment data or employed expert-based methods to forecast the future of jobs, but these approaches were not able to reflect the latest technology trends in an objective way. To overcome the issue, this study matches jobs with patents and forecasts the future of jobs based on changes in the number of patents with time. A word embedding model is trained by patent classification code and job description data and used to find similar patent classification codes of jobs. For an illustration purpose, we identify information technology-related jobs listed in O*NET and discover similar patent classification codes of the jobs. Based on the change in the number of patents, we find promising jobs presenting high technical demands. Several implications of our approach are also discussed.


I. INTRODUCTION
An advance in technology makes considerable changes in the job market. To occupy a dominant position in business fields, executives need to predict a change of jobs and preemptively address a new business model. For the last decades, several companies such as Google, Microsoft, Amazon, and Facebook have developed information technologies and hired many professional experts to maintain and advance the technologies. Following the significant change, many companies modified their initial business models and tried to respond to the change quickly and adequately. On the other hand, some other companies underestimated the technological advances and failed to address the change sufficiently. Consequently, the failure of the prediction to the technology has led them to lose their status. In other words, this suggests that the proper projection of future change significantly determines the success of business models.
Researchers have considered various approaches to predict a change of jobs. They considered economics and employment data to identify quantitative features of the job market and drew prediction reports for the changes in the market. This way was based on statistical data and thus adequate to predict the near future of the job market. However, this approach was limited in prediction because technological changes in the world were not considered enough. Thus, other researchers have employed expertbased approaches such as Delphi and structured survey methods. These approaches secured experts in the target fields and analyzed their opinions in a systemic way. In this way, researchers tried to render more reliable and practical predictions for the job market. However, because the approach mainly relied on experts, the prediction results could be biased depending on the experts' opinions, beliefs, and convictions.
Although the economics and employment data does not contain information about technologies, another type of data, patents, provides meaningful information about how technology will change future jobs. Patents contain the contents of technology and the right to use it. Thus, people try to register patents as soon as possible to secure their rights for the technology and occupy a dominant position. Because of this property, patents have been utilized in technology forecasting to identify the latest technology and its future. In other words, if researchers can calculate similarities between jobs and patents, they can track and VOLUME XX, 2017 1 forecast the change of a particular job through the related patents. For example, we can argue that the importance of a specific job has increased if the number of patents with a specific patent classification code shows an increasing pattern and the classification code is closely related to the job. Focusing on this issue, this study reviews existing methods for calculating similarities between jobs and patents and suggests a word embedding-based approach to job forecasting. We collect descriptions of job attributes from the website of occupational information network (O*NET) and descriptions of patent classification codes from the website of cooperative patent classification (CPC). Based on the descriptions, we train a word embedding model and extract embedding vectors for the descriptions. Then, we find the most similar patent classification code for each job and create a matching table.
The results show that the suggested approach can match job information with patent information better than existing methods. Also, for an illustration purpose, we analyze information technology-related jobs and discuss the analysis results in detail. Our approach does not solely rely on people's subjective viewpoints on jobs but provides a forecast based on the change in the number of patents. This enables people to catch future trends in jobs objectively and helps draw a plausible prediction on jobs.

A. JOB FORECASTING
Predicting the change in the job market has been an important issue in politics and economics research. In the past, researchers argued that the rapid advance in high technologies had a limited impact on the job market because only a few percentages of the job market were related to the high technology, and the employment growth would be small [1]. However, this belief has changed with the fourth industrial revolution. Technologies of robots and artificial intelligence (AI) are expected to replace simple labor and make workplaces more skill-polarized [2,3]. Researchers tried to predict the future of the job market based on numerical data (e.g., labor employment [4] and statistics of graduates [5]) or textual data (e.g., job postings [6,7], job attributes [8], and Google trends [9]). However, these approaches were mainly focused on occupational data and limitedly considered changes in technology.
Several research institutes and firms have utilized Delphi and structured survey methods to forecast the future of the job market, considering the technological aspect. For example, the Millennium Project has conducted Delphi studies for the future of work and released reports based on the findings [e.g., 10]. Similarly, the World Economic Forum has gathered experts' opinions through structured surveys and announced reports for future jobs [e.g., 11]. These studies focused on a macro perspective on the job market.
However, the methods also have been utilized for forecasting specific jobs in detail, such as hotel management [12], marketing [13], and auditing [14] or specific fields in detail such as AI [15]. Researchers tried to forecast future jobs based on experts' opinions in various fields rather than simply predict future trends.
On the other hand, some studies suggested a hybrid approach combining experts' opinions and job-related data. For instance, Bakhshi, et al. [16] gathered experts' opinions for jobs and trained a machine learning model based on the opinions, job description data from O*NET, and employment data from the bureau of labor statistics (United States) and the office for national statistics (UK). Then, they predicted future demands of jobs utilizing the model and anticipated new occupations. However, similar to Delphi and focused group interview techniques, the expert-based approach is limited because the experts' views on future jobs can be largely different even though they share a similar background. For example, Gruetzemacher, et al. [17] surveyed attendees of AI conferences and found that their forecasts on labordisplacing AI scenarios were significantly different depending on what conferences they attended. This suggests that the forecast of future jobs should be based on quantitative data that objectively shows the change of technology.

B. PATENT ANALYSIS
State-of-the-art technologies are often disclosed to the public through academic articles or patent documents. Because of the different characteristics of the sources, it is hard to determine which one is more useful to capture the trend of technology. In general, academic articles are published after a peer-review, and this process is helpful when researchers need to verify and improve their idea. However, the academic article limitedly protects researchers' right to the idea because it does not have legal force. In contrast, patent documents clarify applicants' right to the idea and legally protect it. For this issue, examiners of patents mainly focus on the distinction and exclusiveness of the idea rather than evaluating the idea itself. In the absence of the peer-review process, patents may not promptly reflect the advance of technologies compared to academic articles. However, patents are often preferred in business fields because they have legal force to claim their exclusive use. In addition, because there is a periodic cost to renew a patent, the patent is likely to contain practical technologies compared to the academic article.
Focusing on the characteristics of patents, researchers have analyzed patents to forecast future technologies. Kim and Bae [18] clustered patents based on the patent classification code and assessed the clustering groups based on the top 10 classification codes. Promising patents in the groups were identified by the forward citation, triadic patents, and independent claims. Cho, et al. [19] identified keywords related to high-rise building construction and secured 2875 patents in the United States, Europe, Korea, China, and Japan. Through analyzing the growth of numbers of patents in each technology field, they forecasted promising technologies and diagnosed the levels of technologies in each country. Kyebambe, et al. [20] considered that new classes added to the existing patent classification category indicate emerging technologies and utilized the classes to develop a supervised labeling method to forecast emerging technologies. Zhang, et al. [21] suggested that a topic modeling method can capture a technology roadmap. They divided patents into three groups based on their registration years and constructed topic models with the divided patent data. Then they calculated similarities of topics in different models and suggested what major topics were considered in the field of blockchain and how they were evolved with time.
Although patents were frequently utilized to forecast future technologies in specific fields, only a few researchers used patents to forecast future jobs. Dechezleprêtre, et al. [22] analyzed the impacts of a particular technology on the labor market based on changes in the number of patents and wages. However, they focused on retrospective analysis with a macroscopic perspective and did not provide certain links between jobs and patents. Webb [23] employed a dependency parsing algorithm to extract verb-noun pairs from task descriptions of jobs and titles of patents and calculated similarity scores between jobs and patents. The study suggested a patent-based approach to investigate changes in the job market. However, the accuracy of matching jobs and patents could be improved further by utilizing a more sophisticated natural language processing model.
Recent studies showed some text embedding models such as Word2Vec [24], Doc2Vec [25], FastText [26], and BERT [27] were able to learn a large size of texts and embed them as numerical vectors. This suggests that a text embedding model trained by descriptions of patents and jobs can transform the descriptions as embedding vectors, and similarities between jobs and patents can be calculated based on the vectors. Analyzing changes in patents and linking patents with similar jobs, we can forecast how jobs would be changed following changes in technology.

III. METHOD
As reviewed, job forecasting has been relied on occupational FIGURE 1. Overall process of the job forecasting framework VOLUME XX, 2017 1 data or experts' knowledge but not on data reflecting technological changes. Previous studies showed that patents could be a reliable source to catch technological changes in specific fields. However, patents were not used for job forecasting because there was no method to bridge the job and patent information. To solve the problem, we need to examine several text embedding models for matching patents with jobs and find an effective one. Once the model is found, we can forecast a particular job's future based on the change in the number of patents related to the job. Developing this analogical process will contribute to drawing a more plausible job forecasting.
Concerning the issue, this study suggests a framework to investigate job futures based on patent classification codes. To match jobs with similar patent classification codes, we trained a word embedding model with descriptions of job attributes and patent classification codes and extracted embedding vectors from the descriptions. The extracted vectors were used to calculate similarities of descriptions, and jobs were matched with similar patent classification codes. After the matching process, job futures were forecasted based on the change of patent classification codes with time. FIGURE 1 shows the overall process of the job forecasting framework.

A. TRAINING A WORD EMBEDDING MODEL
To calculate similarities between jobs and patents, we secured data for descriptions of job attributes from the website of O*NET (www.onetonline.org) and patent classification codes from the website of CPC (www.cooperativepatentclassification.org). The O*NET data contains information about 1016 jobs, including job description, knowledge, skill, ability, task, technology skill, tool, and work activity, and has been used for forecasting future jobs [e.g., 16,28]. A patent document is composed of multiple components. Previous studies have focused on the title, abstract, claims, and classification codes of patents to capture the contents of patents [e.g., 29]. The job attributes and patent components can be used to train a text embedding model. Similarities for every pair of job attributes and patent components can be calculated based on the embedding vectors. However, this approach would not be practical because the calculation cost is prohibitive. Thus, we only considered descriptions of patent classification codes rather than all components in every single patent. In this way, we trained a text embedding model with descriptions of job attributes and patent classification codes and extracted embedding vectors for the descriptions.
In general, document embedding models are expected to show better performance than word embedding models because they can reflect contextual information in texts more effectively. However, because the length of texts and amount of the data used in this study are short and small, we considered word and document embedding models together and examined their performances. Among various models, Word2Vec, Doc2Vec, FastText, and BERT are the most popular models used in previous studies and we trained the model through unsupervised learning with the descriptions of job attributes and patent classification codes. Because BERT cannot be trained by an unsupervised learning method (i.e., fine-tuning), we employed the pre-trained BERT-base for comparison. All parameters of the models were set to the default values.
Before the training process, we preprocessed the description texts to facilitate the model learning the descriptions effectively. In detail, we tokenized each text using the WordNet tokenizer and made tokenized words. Then, we lemmatized the texts using the WordNet lemmatizer and removed stopwords using NLTK (Natural Language Toolkit) packages in Python. The lemmatization was conducted considering words' part-of-speech and the part-of-speech was identified by NLTK part-of-speech tagger.

B. VALIDATING THE TRAINED MODEL
Because the model was trained by an unsupervised learning method, we additionally devised classification tasks to . Similarly, the CPC system has a hierarchical architecture, and classification codes can be categorized into 136 classes. We extracted embedding vectors for the descriptions of job attributes and made job representation vectors for each job. Then, we clustered the job representation vectors into 31 groups and compared them with the career clusters. Likewise, we extracted vectors for patent classification codes, clustered them into 136 groups, and compared them with the classes in the CPC system. To examine a proper form of the job representation vector, we made four different types of job representation vectors extracted from (1) job title, (2) job description, (3) job title and description, and (4) job title and attributes, and examined which one showed the best performance in the validation process. K-means algorithm and normalized mutual information score were used for clustering and comparing the clustered groups, respectively. FIGURE 2 and FIGURE 3 show the validation process in detail.

C. MATCHING AND FORECASTING JOBS
We calculated cosine similarities between the job representation vectors and vectors for patent classification codes and found similar patent classification codes for each job. Different results can be gained in the calculation process depending on how deep patent classification codes are considered for comparison (e.g., full codes or a certain level of codes). Thus, we examined similarities between jobs and patent classification codes with different levels of patent classification codes. Based on the result of similarity calculation, jobs were matched with similar patent classification codes, and their futures were forecasted based on the change of matched codes with time. Descriptions of job attributes for each job were vectorized by the trained embedding model and used to construct four different types of job representation vector (job title, job description, job title and description, and job title and attributes). We applied the K-means algorithm for each type of job representation vector and clustered the vectors into 31 groups. The clusters were compared with 31 career clusters developed by O*NET and evaluated based on the normalized mutual information.

A. MODEL TRAIN AND VALIDATION
We examined four different types of models and their performances are presented in TABLE 1. As presented in the table, FastText model showed the best performance for the validation tasks. Normalized mutual information scores for job representation vectors made by (1) job title, (2) job description, (3) job title and description, and (4) job title and attributes were presented by 0.313, 0.415, 0.424, and 0.264, and we confirmed that the job title and description was the most effective attribute representing the job information. Descriptions of CPC codes were also vectorized by the trained model (CPC description vector) and clustered into 136 groups. The clusters were compared with 136 classes of the CPC system, and normalized mutual information score was presented by 0.445. Approximately, normalized mutual information score closed to 0.4 indicates 0.84 true positive rates [30], and we confirmed that the trained embedding model adequately represented job and patent information. FIGURE 4 shows the results of job and patent vector clustering.

B. MATCHING AND FORECASTING JOBS: A CASE OF INFORMATION TECHNOLOGY-RELATED JOBS
We calculated cosine similarities between job representation vectors and CPC description vectors and made a matching      Based on the results, we can say that people's interest in information security analysts has rapidly grown for the last five years, and technical demand for information security analysts is expected to rise. People's interests in computer network architects, data warehousing specialists, network and computer systems administrators, and computer network support specialists also have increased and technical demands for the jobs would be high. However, database administrators, minor computer occupations, computer programmers, database architects, and video game designers did not reveal notable increases in the number of patents. Compared to other jobs, these jobs were not receiving much attention, and technical demands for the jobs are expected to maintain the current status.

V. Discussion
We developed a word embedding model utilizing the O*NET and CPC data to match jobs with patents and forecast jobs from a technological perspective. To develop the model, we employed a FastText model and trained it with descriptions of job attributes and patent classification codes. Because the model was trained by an unsupervised learning method, we devised a validation method with existing classification systems (job clusters developed by O*NET and classes of the CPC system). The validation result showed that embedding vectors extracted by the trained model were consistent with the existing classification systems in terms of the normalized mutual information score, and we confirmed that the model properly reflected the job and patent information.
We examined four different types of job representation vectors and found that combining job title with description was the most effective attribute in presenting the job information. Job representation vectors summing all job attributes showed the worst performance among the four types of job representation vectors. We guess this was because not all job attributes were distributed to all jobs, and thus job information was not adequately reflected in the job representation vector. For instance, the O*NET data described hydrologic technicians only in terms of the job description, tasks, and detailed work activities, whereas business intelligence analysts were described in terms of the job description, knowledge, skills, abilities, tasks, technology skills, tools, and work activities. The unbalanced description of job attributes made the unbalanced reflection in job representation vectors and low performance in the validation process.
To specify the jobs in terms of patents, we identified valid CPC codes during the last five years (2016-2020) and matched them to the jobs. In the process, some jobs were matched with CPC codes that were less representative than other CPC codes. For example, health informatics specialists were matched with CPC codes belonging to A61B2505 group (evaluating, monitoring, or diagnosing in the context of a particular type of medical care). However, the latest CPC system has G16H subclass, which directly represents healthcare informatics. CPC codes belonging to G16H subclass can be regarded as closer to the job than CPC codes belonging to A61B2505 group. However, G16H subclass was not considered in this study because the codes belonging to the subclass were added in 2018. This shows that a period for analysis not only affects the number of patents at each time but also affects CPC codes for the analysis and yields a different result of forecasting jobs.
For an illustration purpose, we identified CPC codes related to information technology jobs and tracked the numbers of patents for the identified CPC codes for the last five years. Changes in the number of patents showed that people's interests in information security analysts, computer network architects, data warehousing specialists, network and computer systems administrators, and computer network support specialists have highly increased, and technical demands for the jobs would be high. In contrast, we could not find significant increases in technical demands for database administrators, minor computer occupations, computer programmers, database architects, and video game designers. However, before accepting the forecast result, we need to care that there could be bias selection of patent classification codes in applying for a patent. For example, patent attorneys could rarely use some CPC codes because of implicit rules or customs. In this case, a job forecast would be invalid, although CPC codes are appropriately matched with jobs. Thus, it is recommended to accept a job forecast result comparing it with other data and analysis reports.
We considered that the increasing number of patents for a specific technology indicates the increasing demand for the technology, and the demand for jobs related to the technology will increase accordingly. However, someone can interpret it differently. As the number of patents for the technology increases, the supply of the technology increases, consequently making decreased demands for jobs related to the technology. It would be hard to confirm which one is the correct interpretation based on the pattern of the number of patents only. Thus, as mentioned above, researchers should consider other indicators such as statistics of employment, earnings, and graduates together to reach a reasonable decision.
In addition, although we tried to match various jobs with patents, some jobs were hard to describe in terms of patents and not adequately matched with patent classification codes. For example, actors and dancers were matched with F21V33/0056 (structural combinations of lighting devices with other articles, not otherwise provided for -personal or domestic articles -audio or video equipment, e.g., televisions, telephones, cameras or computers; remote control devices therefor -audio equipment, e.g., music instruments, radios or speakers) and H04N7/155 (television systems -systems for two way working -conference systems -involving storage of or access to video conference sessions), respectively. This suggests that not all jobs are proper to forecast in terms of VOLUME XX, 2017 1 patents, and researchers should care about the appropriateness of forecasting based on the matching result of jobs and CPC codes.

VI. Conclusion
The present study suggested a patent-based approach to forecasting the future of jobs. To identify a relationship between patents and jobs, we secured CPC code description and job description data and trained a word embedding model with the data. Validation results confirmed that the trained model properly reflected the context in CPC code and job descriptions. For a demonstration purpose, information technology-related jobs were matched with CPC codes, and the future of jobs was forecasted based on the number of patents for CPC codes. The forecast results showed that our approach can be practical but also limited in some cases. We discussed several implications based on the results. Future studies can consider the different job and patent data sources to improve job forecasting performance (e.g., online job posts and titles and abstracts of patents). Also, because our analytical approach mainly relies on patent data, securing reliable and up-to-date patent data would be critical. For example, Google patent search, which shows patent information in real-time, can be considered for building a data pipeline for our approach. A data pipeline with real-time data would enable an agile system for job forecasting. The results of the studies will confirm the validity and effectiveness of our approach.