Characteristics of Skilled and Unskilled System Engineers in Troubleshooting for Network Systems

As information and communication technology systems become larger and more complex, system troubleshooting difficulty increases. To date, however, no efficient method for troubleshooting training has been developed owing to a lack of understanding of how skilled system engineers perform troubleshooting. The goal of this study was to investigate and compare the network troubleshooting characteristics of skilled and unskilled system engineers. We hypothesized that to efficiently troubleshoot a network system, skilled system engineers divided the overall network into functional and non-functional sub-networks by confirming connections between network devices using similar method. To observe troubleshooting behavior, we developed a virtual network comprising several servers, routers, and terminals on which a group of six skilled and unskilled system engineers performed normal troubleshooting activities. It was found that the skilled system engineers tended to narrow down the problem space by connection confirmation between network devices. The coincidences of connection confirmation between the skilled system engineers were significantly higher amongst the whole group. At the beginning of the troubleshooting assessment, the most skilled participants appropriately hypothesized which device was experiencing trouble, based on information presented in advance of the assessment. In contrast, the unskilled system engineers, and/or those unfamiliar with network troubleshooting, did not narrow the problem space but instead randomly searched for obstacle causes in selected network devices. These results suggest that unskilled system engineers should be taught methods for the appropriate and logical reduction of the problem space in network troubleshooting.


I. INTRODUCTION
As information and communication technology (ICT) systems become extremely large, complex, and integrated [1]- [4], system engineers (SEs) are experiencing increasing difficulty in the operation and maintenance (O&M) of their systems. When an obstacle occurs in an ICT system, troubleshooting procedures, in which the SE has to detect obstacle causes and repair/replace them, must be carried out. However, a shortage of skilled SEs can result in delays in system recovery. To address this issue, many studies on the automation of The associate editor coordinating the review of this manuscript and approving it for publication was Gustavo Olague . the O&M technologies of ICT systems have been conducted [5]- [9]. Although automation technology can be helpful in making troubleshooting more efficient and hastening recovery from obstacles, it is difficult to completely automate the troubleshooting of ICT systems because troubleshooting is typically an ill-structured and ill-defined problem [10]. In addition, ICT systems, along with their respective obstacles and optimal troubleshooting approaches, vary. It therefore remains very important to secure appropriately skilled SEs to ensure the stable O&M of ICT systems.
Unskilled ICT SEs generally learn troubleshooting by dealing with actual obstacles via on-the-job training (OJT), eventually organizing the abundant knowledge gained from VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ their experience as they become more skilled as troubleshooters [11]. Accordingly, the accumulation of organized troubleshooting experience is critical to the transition between novice and advanced troubleshooter [12]. The development of automatic troubleshooting could therefore deprive beginners of opportunities to learn troubleshooting through OJT because automatic troubleshooting systems could reduce the number of chances for the SE to deal with system obstacles. Therefore, an efficient learning method for ICT troubleshooting is required. To date, however, no systematic method for educating SEs in troubleshooting has been developed because it is poorly understood how skilled SEs troubleshoot, and what the differences in terms of troubleshooting are between skilled and unskilled SEs.
Troubleshooting is a problem-solving task in which problem causes are identified and repaired/replaced to restore the system to its normal state [10], [13]- [15]. The characteristics of skilled and/or unskilled troubleshooters have been investigated in the context of the troubleshooting of generator systems [11], [16], electric circuits [17], manufacturing systems [14], chemical plants [18], radar systems [19], [20], etc. For example, Johnson [11] demonstrated that an expert troubleshooter could appropriately interpret the initial state of a generator system to reduce the problem area including the possible obstacle cause. On the other hand, a novice could not interpret the initial state and, instead, randomly searched selected parts of the system. Schaafstal et al. [20] considered troubleshooting as a cognitive task for searching for obstacle causes in an expansive problem space. This cognitive task includes four subtasks: formulation of a problem description, generation of a hypothesis regarding the obstacle causes, testing of the hypothesis, and repair and evaluation. In formulating the problem description, the troubleshooter must divide the overall system into functional and non-functional sub-systems. This excludes from the problem space the area that does not need to be investigated. Given the limitations on human working memory, the reduction of the problem space is critically important to the success and speed of troubleshooting. This formulation of the problem description is generally used by experts not only in the field of troubleshooting but in other fields of problem solving such as chess [21]. If troubleshooters can succeed in formulating a problem description in an early stage of the process, they can more effectively detect obstacle causes. The generation of a cause hypothesis depends on whether the troubleshooter is familiar with the respective obstacles that might be involved. In a familiar situation, the expert can make decisions through a recognition process by applying a type of searching called symptomatic searching [22] or recognition-primed decision-making [23].
In an unfamiliar situation, the expert must make logical inferences in carrying out decision making through a type of searching called topographic searching [22] or analytical decision-making [23]. Once a hypothesis has been made, the troubleshooter must correctly test it and interpret the outcome of the test. After identifying the causes through this testing, they must then repair the causes and evaluate the repairs.
It has been unclear whether skilled SEs apply this troubleshooting process to the O&M of ICT systems. In addition, the differences between skilled and unskilled SEs are also unclear. As mentioned above, ICT systems are generally complex, with network systems typically comprising several servers, routers, and terminals and the connections between these devices. A SE skilled at finding network obstacles would first reduce the problem space in a manner similar to troubleshooters in other fields, with the performance of this reduction relating to the success and speed of the troubleshooting.
The purpose of this study was to investigate and compare the troubleshooting characteristics of skilled and unskilled ICT SEs. In particular, we investigated how the respective groups formulate a problem description and whether there were common characteristics among skilled SEs in terms of how they reduced the problem space. For this purpose, we observed the troubleshooting behavior of skilled and unskilled SEs on a mock-up network system, and then performed retrospective interviews of the participants to reveal how they carried out their troubleshooting decision-making processes. We looked at network-system troubleshooting rather than the troubleshooting of other ICT systems because less technical knowledge is required for network troubleshooting than is required for other ICT systems such as databases. We assumed that the skilled SEs would divide the overall network into functional and non-functional subnetworks by ''path-checking'' or ''connection confirmation'' between network devices. With this narrowing down of the problem space, skilled SEs could efficiently troubleshoot the network and find the problem cause. In addition, skilled SEs would confirm the connections between devices in a similar method by referring to the status of the network.

A. PARTICIPANTS
The participants were six SEs employed by a private company who had held SE positions for different lengths of time and who had occupational focuses in the O&M of ICT systems (Table 1). Participants A, B, and C were the most experienced in the O&M of ICT networks and had each worked in the field for more than five years. In particular, participant A was a highly skilled SE. Of the three, only participant C generally operated and monitored networks remotely. Participant D was also an experienced SE but worked in the field of database O&M. Participants E and F were relatively unskilled SEs who worked in the O&M of networks and applications, respectively. Participant E also generally operated and maintained networks remotely. The experiment was authorized by the ethical committee of the University of Tokyo, and written informed consent was obtained from all participants.

B. BEHAVIOR OBSERVATION AND INTERVIEW 1) NETWORK, OBSTACLES, CAUSES, AND SCENARIOS
To observe participant troubleshooting, we created a virtual ICT network system comprising four sub networks, namely: service, guest, office automation (OA), and operation networks. Fig. 1 shows the layout of the overall network. The service network included a web server and was connected to the guest, OA, and operation networks via their respective routers. Users of this network could browse information by accessing the web server from the guest and OA terminals via their respective routers. From the operation terminal, the O&M SEs could access the web and monitoring servers, all routers, and all terminals. Problems in the network detected by the monitoring server were displayed on the operation terminal. The SEs could access the monitoring server from the operation terminal without using any of the routers and, if necessary, could operate the web server, routers, and the guest and OA terminals. The network was virtually constructed on a PC but was designed to resemble the specifications of actual network equipment, enabling the participants to operate and troubleshoot the virtual network in the same manner as a real network. Johnson [24] categorized the troubleshooting process from the viewpoints of frequency and difficulty. If it were easy to identify obstacle causes, there would be no difference between skilled and unskilled troubleshooters. As an unskilled SE troubleshooting for infrequent obstacles might not take any action, we applied obstacles that occur frequently and whose causes were difficult to identify. We set two obstacle causes for detection by the participants. The first was the failure of static route setting in the web server, in which the return of the static route was not passed from the operation network to the web. This was a result of a failure to set static route persistence. This resulted in an obstacle that made it impossible to communicate in any manner between the operation network and the web server. The second obstacle cause was a failure in the firewall of the web server, resulting in a denial of HTTP connection from the operation network to the web server. Communication failure between devices in a network is a frequent obstacle, and participants who confirmed a connection between the service and guest/OA networks could be falsely led to an initial hypothesis that the operation router had problems, a deceptive feature that made the problem more difficult to solve.
As an expert in troubleshooting might use information regarding the context in which the ICT system was used and its history, we prepared the following network scenario for occurrence of the obstacles. The original network comprised the service, OA, and guest networks, with the operation network subsequently added for O&M purposes. To handle web server and router vulnerability, these networks had been maintained following addition of the operation network. Prior to this maintenance, users were able to connect normally from the guest and OA terminals to the web server, and no trouble had been detected by the monitoring server. During the maintenance, monitoring of the web server and the three routers by the monitoring server had been restricted. The OS kernel of the web server and the firmware of each router were then updated and these devices were restarted. Finally, after no errors were detected, monitoring was restarted. This pre-obstacle scenario was presented to the participants in advance of the troubleshooting exercise. They had not been told, however, that during the maintenance a failure that eventually caused the obstacles had occurred. After updating the web server setting, the maintenance engineer had cancelled the restriction on the monitoring server that kept it from monitoring the web server, but had failed to make the setting persistent. Thus, after restarting the web server the setting had reverted to the state prior to updating, with the two obstacle causes, namely, the failed settings of the static route and the firewall in the web server.

2) BEHAVIOR OBSERVATION
The task was to detect each of obstacle causes within 60 min. We asked the participants to declare when they had detected the cause of a problem and then to correct the cause. We did not tell them how many obstacle causes there were or what they were; instead, we instructed them to first check the alerts presented by the monitoring server by using the operation terminal. Upon commencement of monitoring, the participants found alerts regarding the connection from the monitoring sever to the web server. The SEs were accustomed to VOLUME 8, 2020 operating and maintaining different types of ICT equipment such as servers, routers, and monitoring systems; therefore, to reduce the effect of differences in the participants' knowledge of equipment, they were instructed in how to use the operation terminal before starting the task. In addition, they could refer to a list of commands for controlling the servers, routers, and operation terminal and were provided with a paper copy of the logical configuration of the network system. During the course of the task, the participants could also question a nearby experimental administrator regarding the operation of the virtual system. However, the administrator provided instruction on how to operate the guest and OA terminals only when asked to do so by the participants.
We assumed that, upon seeing the alert from the operation server, unskilled participants would realize that the obstacles could have potentially arisen anywhere between the web server and the operation terminal but would not further separate the entire network into sub-systems that worked well and did not work well. We also assumed that they could not appropriately narrow the problem space or make correct and logical causal hypotheses. As a result, they would randomly investigate the devices included in the network [24]. By contrast, we hypothesized that skilled participants would first formulate a problem description, that is, they would logically divide the overall network into sub-networks that worked well and did not work well [20]. We assumed that they would check the connection between the operation terminal and the web/operation servers to confirm that the alert on the monitoring terminal was correct, and that they would also investigate the connection between the service networks and guest/OA networks. These investigations would show a disconnection only between the web server and the operation terminal, which would enable the participants to confirm that the obstacle causes were present between the web server and the operation terminal. The participants might then mistakenly hypothesize that the obstacle causes for disconnection between the web server and operation terminal were in the operation router because the web server responded to the guest/OA terminals. A skilled SE would presumably test this hypothesis, which would be rejected because there were no problems in the operation router. This line of investigation would finally lead them to focus on the web server. If they confirmed the connection between the guest/OA routers and the operation terminal, they could correctly categorize the operation router as a normal part and the web server as an abnormal part.
We recorded the methods by which the participants performed these tasks using a video camera and a wearable camera (HX-A1H, Panasonic Corporation) strapped on the participants' heads that showed the operation from their viewpoint. We also recorded the screen of the PC used by the participants. In addition, the participants' entries into the PC were recorded, with the recorded data used for the interview, as described below, and for analyses of the participants' behavior.

3) INTERVIEW
We used retrospective interviews to extract the participants' individual decision-making processes. Retrospective interviews were used instead of a think aloud method, in which the participants perform a task while narrating their thoughts [25]- [27] to prevent heavy cognitive loading, particularly on the part of unskilled participants. (The training of the participants often required the use of the think aloud method.) We conducted the retrospective interviews on the day after the troubleshooting experiments. Three experimenters interviewed each participant, and the interviews were recorded with a video camera. To make it easier for participants to remember their behavior and decision-making processes during the task, we had them watch a streamed movie created by integrating the video camera, wearable camera, and PC screen videos. In addition, a list of commands executed on the terminal was presented to each participant. The executed commands list showed, in chronological order, the timings of the executions, the execution sources, the execution commands and their explanations, and the results. To extract the thoughts of the participants during the experiment, the experimenters asked them to provide reasons for putting in each command while referring to the video stream and command history. Examples of questions included: ''What did you want to confirm by conducting this command?,'' ''What did you think after seeing the command execution result?,'' and ''What were you going to do next?''

A. IDENTIFICATION OF OBSTACLE CAUSES
Participants A and B, who were both skilled network O&M SEs, identified both network obstacle causes. Participant C, the remote network O&M skilled SE, identified only the failure of the server static route setting. Participant D, the skilled database O&M SE, and participants E and F, the unskilled SEs, did not detect any obstacle causes. Table 2 lists the times taken by each participant to identify the respective causes. Although participants A and B identified the two causes in reverse order relative to each other, they took similar amounts of time (nearly an hour) to detect each cause. However, participant A-the most skilled SE-came close to identifying both obstacle causes at the beginning of the task (the details are discussed later). Although participant C did not find the firewall setting failure, they noted it at the end of the task; however, the participant could not repair the obstacle because the available time had expired. These results confirmed that experience in network O&M contributes to ability at ICT network troubleshooting.

B. BEHAVIOR AND DECISION MAKING
To investigate the characteristics of the skilled and unskilled SEs, we analyzed the command operation history of each participant. We divided the sets of commands executed by the participants into three types: connection confirmation, cause investigation, and repair. Connection confirmation commands were used to check connections between devices using HTTP, SSH, or ping protocols. Such commands could be used to segment the network into normal and abnormal sections. Cause investigation commands were used to search for obstacle causes by checking the history or settings of individual devices. Repair commands were executed to correct obstacle causes. Table 3 lists the number of each type of command carried out by each participant. Participants A, B, and C each carried out more than 39 commandsmore than any of the other participants. Of the other three, participant D, the skilled database O&M SE, executed the highest number of commands, while E and F, the unskilled SEs, each executed fewer than 30 commands. The differences in terms of number of commands is representative of the amount of participants' knowledge. Participants A, B, and C each carried out more than 20 connection confirmation commands-higher numbers than D and F, whose fields were not network O&M. These results suggest that the skilled SEs divided the overall system into normal and abnormal parts more effectively than the SEs unfamiliar with network O&M. Although participant E, who was an unskilled network O&M SE, did not find any obstacle causes, they executed nearly as many connection confirmation commands as A, B, and C. Meanwhile, participants D and F executed more cause investigation commands than the other participants. To further analyze the characteristics of each participant's troubleshooting, we investigated their respective behavioral histories (Fig. 2a-f). Participants A, B, and C all confirmed the connection between the web server and the guest/OA terminals, revealing the good connection between the guest/OA networks and the service network. They also checked the connection between the operation terminal and the guest/OA routers. Based on these confirmations, they could appropriately locate the problem space within the web server. In their interview sessions, these participants reported that they confirmed the connections to segment the network into normal and abnormal parts, indicating that each skilled network O&M SE appropriately formulated a problem description.
However, the order of behavior differed among the three participants. Based on the misleading information we provided, participant B checked for obstacle causes in the operation router before examining the web server (Fig. 2b). Although participant B confirmed the connection between the OA router and the operation terminal in the early stages of their investigation (Fig. 2b, No. 3), they failed to narrow the problem space to the web server. In the interview, the participant stated that they had attempted to confirm the connection between the operation router and terminal but accidentally confirmed the connection between the OA router and the operation terminal. The participant had therefore not taken steps to narrow down the problem space to the web server. Then, following confirmation of a connection between the web server and guest/OA terminals (Fig. 2b, Nos. 6-8), participant B followed our misleading information and mistakenly concluded that there were problems in the operation router. After the test for the operation router was rejected, participant B hypothesized that there were problems in the web server and, after confirming the connections between the operation terminal and guest/operation router, the participant narrowed the problem space to the web server. Participant C was also misled into investigating the operation router before examining the web server. The participant confirmed the connections between the web server and guest/OA terminals (Fig. 2c, Nos. 18 and 19) and between the guest/OA routers and operation terminal (Fig. 2c, Nos. 20, 23, and, 24) later than participant B. In their interview, participant C said that, because they typically monitored and operated a network system remotely, they had put off their investigation of the guest/OA networks. Although they differed in terms of the order and speed with which they confirmed the connections, participants B and C both followed our assumptions in logically formulating a problem description.
Participant A, the most skilled SE, confirmed the connection between the guest/OA router and operation terminal at the beginning of the task (Fig. 2a, Nos. 4 and 6). By doing so, they succeeded in narrowing the problem space to the web server at a very early stage. In their interview, the participant stated that updating the web server sometimes led to network obstacles and that, upon being presented with the simulation scenario concerning the updating of the server, they suspected that some problems had arisen in the web server and therefore immediately checked the connections between the operation terminal and the web server (Fig. 2a, Nos. 1, 3, and 5), and between the operation terminal and the OA/guest router (Fig. 2a, Nos. 4 and 6). Following these connection confirmations, participant A investigated the web server to find the obstacle causes (Fig. 2a,  Nos. 7, 8, and 10). These results suggest that a skilled SE will hypothesize obstacle causes using both logical reasoning and inference taken from context and experience. However,   because participant A misread the message concerning the server status, they mistakenly rejected the hypothesis that they had generated from the context and assumed that the server was working well. They proceeded to again narrow the problem space by logically dividing the network into normal and abnormal parts (Fig. 2a, Nos. 11-16, 19, 24, and 26-30).
In this process, they identified no troubles outside of the web server and therefore reconsidered their earlier judgment, realizing that they had potentially misread the web server status at the early stage of the test. After investigating the web server again, they eventually identified the obstacle causes residing in it. This logical formulation method after misreading the message was similar to approaches used by participants B and C.
Participant E confirmed the connection between the operation terminal and the guest/OA routers in the early stages of the test (Fig. 2e, Nos. 8 and 9). They also confirmed the connection between the operation terminal and the operation router (Fig. 2e, Nos. 3-5). Following these confirmations, however, the participant investigated the operation router to find the obstacle causes (Fig. 2e, Nos. 6, 7, 16, 17, and 22-25), suggesting that they did attempt to divide the overall system into normal and abnormal components but could not interpret the results of their commands. In the later stages, they also confirmed the connection between the OA terminal and the server (Fig. 2e, No. 20) but not between the guest router and web server. This behavior could be attributed to the fact that participant E, like participant C, was accustomed to operating and managing a network remotely. In their interview, participant E stated that they had not considered controlling the guest/OA terminals early in the simulation because they did not do so directly in their daily work. Participant D did not check the guest/OA network at any point (Fig. 2d), while participant F confirmed the connection between the OA router and the operation terminal but used many fewer connection confirmation commands than the other participants. These results suggest that the participants unfamiliar with or unskilled at network O&M were not able to appropriately formulate a problem description. As a result, participants D and F randomly searched for obstacle causes in the web server and operation router. In their interviews, they stated that, upon reading the alert from the monitoring server during the first stage of the task, they investigated the web server and the operational router in an ad hoc manner. This behavior corresponds to their more extensive use of cause investigation than connection confirmation commands (Table 3). Although participant F confirmed the connection between the operation terminal and the OA router, their interview responses revealed that they could not interpret the outcome correctly.

C. COINCIDENCE BETWEEN PARTICIPANTS
To investigate similarities between participants' behaviors, we calculated the coincidences in connection between the devices that were checked by the participants in their respective connection confirmation processes. In this case, coincidence between two participants was calculated as the ratio (expressed as a percentage) of the number of identical connections confirmed by the participants to the total number of the connections they had checked. Through one connection confirmation, participants could simultaneously check different connections. For example, a participant confirming the connection between terminal X and server Z via router Y could also confirm the connections from X to Y and from Y to Z. In addition, in confirming HTTP or SSH connections from X to Y, a participant could also confirm the ping connection between X and Y. Therefore, if a participant entered a command to check such a connection, we noted that they had in so doing checked all potential connection pathways. As mentioned previously, participant A came close to identifying both obstacle causes in the early stages of testing; accordingly, we also calculated the coincidences between the commands issued by participant A (Nos. 1-10 in Fig. 2a) and the other participants up to the point that participant A misread the server status (referred to as participant A ). Table 4 lists the coincidence results for each pair of participants. A coincidence value of one would correspond to two participants having issued an identical set of checking commands, while a value of zero would correspond to no overlap in commands. All coincidences between participants A, B, and C are greater than or equal to 0.78. By contrast, the coincidences between participants D, E, and F all range from 0.46 to 0.58, while the coincidences between participants who identified more than one obstacle cause (A, B, and C) and those who did not (D, E, and F) ranged from 0.29 to 0.64. These results suggest that the skilled participants formulated a problem description and identified the obstacle causes in a more similar manner than the unskilled SEs did. The coincidences between A and the others, including participant A, were all less than or equal to 0.46, suggesting that the behavior of participant A in the early stages of testing was unique.
We conducted the Kruskal-Wallis test [28] to investigate the differences of the coincidences between the following groups: the coincidences between the skilled participants excluding A (C S , N=3), those between the other pairs excluding A (i.e., between the skilled and unskilled participants and between the unskilled participants) (C O , N=12), and those between A and the other participants including participant A (C A , N=6). There were significant differences (χ 2 (2) = 10.07, p = 0.006, r = 0.59) between the three groups. In addition, we conducted a multiple comparison with the pairwise Wilcoxon test [29], modified by the Benjamini-Hochberg method [30]. The results showed significant differences between C S and C O (p = 0.034, r = 0.46) and between C S and C A (p = 0.037, r = 0.46), but not between C O and C A (p = 0.066, r = 0.4). Thus, the coincidences between the skilled SEs were higher than those between the other pairs. In addition, the coincidences between A and the others were lower than those between the skilled SEs.

IV. DISCUSSION
The purpose of this study was to reveal the network troubleshooting characteristics of skilled and unskilled ICT SEs. It was found that the skilled SEs initiated the process by confirming connections between network devices, reducing the problem space by logically separating the overall system into normal and abnormal sub-systems. By confirming the connections between the service and guest/OA networks, they categorized the latter as normal subsystems. Based on misleading information we supplied, some of them incorrectly hypothesized that the operation router was the source of some of the problems; however, after testing and rejecting this hypothesis, they confirmed the connection between the operation terminal and guest/OA routers and then correctly reduced the problem space to the web server. The coincidence of connection confirmation between the skilled SEs was significantly higher than that between the others. This result suggests that the skilled SEs confirm the connections between devices in a similar manner. The most skilled SE used another method to identify obstacle causes; using their own experience in web server maintenance, they correctly hypothesized early on that the web server was the source of some of the troubles. This SE appropriately reduced the problem space to the server by confirming the connections between the operation terminal and guest/OA routers. The SEs who were unskilled in or unfamiliar with network O&M randomly searched for causes in various network devices and/or failed to correctly interpret their outputs. As a result, they could not accurately identify any obstacle causes.
The connection confirmation coincidence results were higher between the skilled SEs than between the unskilled SEs ( Table 4), indicating that the former logically formulated their problem descriptions in a common manner, although there were differences between them in the order of connection confirmation. The general problem-solving skills exemplified here are known to relate to critical thinking skills [31], including the ability to identify, analyze, and evaluate information required for decision-making [32]. MacPherson showed that experts in the troubleshooting of manufacturing systems had highly developed critical thinking skills [14]. As explained in the previous paragraph, the skilled SEs were also observed to have high critical thinking skills, potentially accounting for the common method applied by this group in formulating a problem description. Another aspect of critical thinking is the ability to monitor and evaluate one's own thinking [33], [34]. When they rejected a causal hypothesis, participants B and C both restarted their formulation of the problem description using connection confirmation. Participant A incorrectly rejected one causal hypothesis based on a mistaken reading of the web-server status but then adopted the same hypothesis to logically confirm a connection between network devices. This reconsideration of the initial hypothesis is consistent with highly developed critical thinking skills on the part of participant A.
However, critical thinking skills alone are not sufficient for undertaking appropriate and fast troubleshooting because such skills are not always applicable by those who possess them [31]. Feltovich demonstrated that non-experts cannot correctly apply critical thinking skills when dealing with new obstacles because they do not have sufficient experience and cannot arrange their knowledge, whereas experts can classify the knowledge gained from experience and can access appropriate knowledge during troubleshooting [35]. Chua [25] noted that experts combine given knowledge and their own experience to build a knowledge structure, while MacPherson [14] demonstrated that the number of years of experience and technical knowledge are important predictors of both critical thinking skills and near-transfer (i.e., the ability to apply knowledge to new contexts similar to those that have been previously experienced [36]). The number of years of experience directly correlates to the extent of an expert's experience-built knowledge structure. Although participant D was a database troubleshooting expert and might have had high critical thinking skills, they were not able to appropriately narrow the problem space in the manner of the SEs skilled in network troubleshooting. Participant D could confirm the connections between some devices (Fig. 2d), indicating that they had knowledge relevant to connection confirmation, but could not deduce the problem space appropriately. This result suggests that participant D did not have an appropriately structured knowledge base for network systems as a result of their lack of specialized troubleshooting experience, which was limited to database troubleshooting. In addition, the troubleshooting approaches within the specific domains included in the ICT system, such as the network, database, and application domains, were not ''near'' each other, especially in the network and database cases. ICT system troubleshooters generally follow divergent career paths leading to their becoming generalists or specialists. The latter acquire deep knowledge of a specific domain within the ICT system and develop abundant experience that is useful for troubleshooting in that specific area. It is therefore generally difficult for specialists to use their skills for troubleshooting in another domain of the system.
As mentioned above, the experts were able to hypothesize obstacle causes based on their experience with familiar problems, which is useful in fast troubleshooting [22]. At the early stages of the task, participant A appropriately hypothesized that the web server had incurred problems based on the maintenance procedure. In their interview, the participant stated that they had experienced similar problems in the past. As a result, participant A was able to use symptomatic search [22] or recognition-primed decision-making [23] to formulate their hypothesis. Such symptomatic searching, however, should not be taught to novice troubleshooting SEs. Various types of obstacles can occur in large, complex ICT systems, and the causes of even similar obstacles can differ. Although symptomatic searching sometimes speeds up the troubleshooting process, it does not always produce correct hypotheses. When a hypothesis made using symptomatic searching is rejected, the SE should immediately start a topological search, as was done by participants A, B, and C. Although troubleshooting using topological searching is generally slow, it is more reliable than symptomatic searching. Therefore, unskilled SEs should first learn the topological search method to enable them to acquire the skills needed to more accurately identify obstacle causes.
Several studies have looked at the methods used to train and instruct engineers in the troubleshooting process [18], [37], [38]. Darabi et al. [18] revealed the learning effect of practice through the use of a simulation model for chemical-plant troubleshooting. Learners who practiced troubleshooting using their simulation outperformed learners who had only been instructed in troubleshooting theory in terms of near transfer skills. This suggests that the mock-up used in this study could be helpful in instructing unskilled SEs in network system troubleshooting. Given that opportunities to experience troubleshooting in OJT have decreased with the development of automatic troubleshooting systems, such mock-up systems could be useful in the training of novices in troubleshooting. However, the causes of obstacles in ICT systems are not limited to the software problems simulated in our experiment, but also include hardware problems. In future work, it would be useful to assess how SEs identify and address hardware problems.
There were some methodological limitations to this study. First, the virtual network that was constructed for the experiment was much simpler than those that are typically operated and managed by SEs, mostly because of constraints on the participants' time. In managing a complex system, an expert will divide the overall system at several abstract levels to reduce its complexity and save their working memory [19], [20]. In a similar manner, a skilled SE troubleshooting a complex ICT system would also divide up the system at an abstract level. In future work, the behavior of SEs in troubleshooting more complex systems should be examined. The second limitation was the small number of participants and the restriction of observed behavior to a single troubleshooting task. To make our results more general, more participants addressing more tasks should be investigated. Finally, we targeted only the troubleshooting of an ICT network system. To achieve a more comprehensive understanding of ICT system troubleshooting, the skills used in troubleshooting other systems, such as database and application systems, should be investigated.

V. CONCLUSION
The purpose of this study was to investigate the behavioral characteristics of skilled and unskilled SEs in troubleshooting a network system. Using a common approach for connection confirmation between network devices, the skilled SEs participating in the study successfully identified obstacle causes. In doing so, they logically and appropriately divided the overall network into normal and abnormal sub-systems. The most skilled SE generated a hypothesis of obstacle causation using their knowledge of the network system context, with which they had previous troubleshooting experience. While the skilled SEs used logical reduction of the problem space and/or hypothesis generation to appropriately identify the obstacle causes, the SEs who were unskilled or inexperienced with network troubleshooting were unable to successfully narrow the system down to the areas in which the obstacle causes were present. Instead, they randomly searched for obstacle causes in various network devices. These results suggest that SEs who are unskilled at network system troubleshooting should be taught skills such as the critical thinking needed to logically segment a system into functional and non-functional sub-systems. His current research interest is expert knowledge extraction from engineers who perform fault diagnosis of production lines.
SHOTA FUKUDA was born in Kanagawa, Japan, in 1993. He received the B.S. degree in engineering from the Systems Innovation, Faculty of Engineering, The University of Tokyo, and the M.S. degree in engineering from the Department of Systems Innovation, School of Engineering, The University of Tokyo.
Since graduating from the master's program, he has worked at VMware, Inc. In earning his bachelor's degree, he involved in research on route search algorithms in traffic flow simulation. In his master's program, he studied traffic flow prediction using traffic flow simulation and graph convolutional deep learning. He has over ten years of R&D experience in IT Operations. He is currently a Senior Research Engineer with the Systems R&D Center, NS Solutions Corporation, Tokyo, Japan. His research interests include knowledge management, knowledge sharing, and intelligence augmentation in IT operations.