Electric Power Fuse Identification With Deep Learning

As part of arc flash studies, survey pictures of electrical installations need to be manually analyzed. A challenging task is to identify fuse types, which can be determined from physical characteristics, such as shape, color, and size. To automate this process using deep learning techniques, a new dataset of fuse pictures from past arc flash projects and data from the web was created. Multiple experiments were performed to train a final model, reaching an average precision of 91.06% on the holdout set, which confirms its potential for identification of fuse types in new photos. By identifying fuse types using physical characteristics only, the need to take clear pictures of the label text is eliminated, allowing pictures to be taken away from danger, thereby improving the safety of workers. All the resources needed to repeat the experiments are openly accessible, including the code and datasets.


I. INTRODUCTION
W HEN performing live work in electrical power systems, two main types of risk are present: 1) shock and 2) arc flash.A shock occurs when current goes through a part of the human body, while an arc flash is defined as the energy blast that occurs when an arcing event is created following a short-circuit between circuit parts of different voltage amplitude or phase.The worst-case intensity of an arc flash event at every equipment in an electrical installation, which is expressed as the incident energy in calories per centimeter squared (cal/cm 2 ), can be evaluated by performing an arc flash study following the requirements and methodology of the IEEE 1584 [1] and IEEE 1584.1 [2] standards.
The first step of an arc flash study is to collect the system and installation data, which include the identification of all protective devices [1], including fuse types [2].This data can be obtained during a survey of the installation using a camera to gather pictures of every protective device.Following the surveys, the entire electrical system is modeled in a power system studies software, such as EasyPower, ETAP, or SKM.The pictures gathered during the surveys and the installation's single-line diagrams can be used for modeling.These software contain detailed libraries of common protective devices and their respective time-current curves.
A critical parameter in the arc flash incident energy calculation is the arc duration, which can be determined from the upstream protection's time-current curve [1].Each protection device in the network can be of different type that can affect the time-current curve in multiple ways.
The fuse types are critical information when performing an arc flash study: for the same ampacity rating, a fast-acting fuse can induce arc flash events of a few calories (i.e., 1-2 cal/cm 2 ), while a time-delay fuse can induce arc flash events of up to several tens of calories (i.e., 30-40 cal/cm 2 ) for the same short-circuit current value, since the arc duration can be much longer with a time-delay fuse than with a fast-acting fuse.The time-current curves for a particular fuse can be obtained by identifying its type, which can often be printed directly on the fuse label.However, in some situations, live work approach limits, equipment access limitations, or poor picture quality (i.e., blurriness, low brightness, occlusion) can make it difficult to correctly identify fuse types from the pictures gathered on site.In these situations, the fuse type cannot be determined from the survey pictures [see Fig. 1(a)-(d)].In other situations, survey pictures are clearer This dataset contains some pictures in which fuses are difficult to identify due to poor quality (i.e., blurriness, low brightness, occlusion) and are shown here with red arrows, while others are easy to identify.(a) Three fuses of type "Ferraz Shawmut CRS" are partially cut from the top-left of a picture of a disconnect switch (occlusion).(b) Three fuses of type "Ferraz Shawmut AJT" are very slightly visible from the center window of a disconnect switch (low brightness, occlusion).(c) Three fuses of type "Ferraz Shawmut AJT" are blocked by plastic parts inside a motor control center cubicle (low brightness, occlusion).(d) Two fuses of type "Gould-Ferraz Shawmut A4J" can be seen inside the partially open door of a disconnect switch (blurriness, occlusion).(e) Three fuses of type "Gould-Ferraz Shawmut A4J" can be clearly identified inside a disconnect switch.(f) Example of a fuse picture from the "Google Images Dataset."This picture scraped from an online catalog created by the manufacturer found on Google Images shows a clear, high-resolution front-view picture of the fuse type "Gould-Ferraz Shawmut A4J." and fuses are easier to identify [see Fig. 1(e)].In other popular benchmark image classification or object detection datasets, such as ImageNet [3] and COCO [4], object classes (e.g., trees versus cats versus planes) are perhaps more distinct by shape and color and may be easier to discriminate between each other.The pictures are often taken with the object themselves in focus with great clearing distance around the object, which can help take clear pictures.In our case, objects are similar in shape (fuses are all cylindrical) and objects can often be in bad conditions due to physical limitations and limits of approach due to high voltage hazards: out of focus, brightness too low, brightness too high [see Fig. 1(a)-(d)].Fuses can often be partially contained inside pictures [i.e., cut on the borders or obstructed by other objects, see Fig. 1(a)-(d)], but they have to be identified as well.The development of new techniques to help accelerate the work of fuse identification when conducting arc flash studies has a great application potential in the industry of electrical power system studies and would be beneficial to the community.
When fuse information cannot be identified clearly from the survey pictures, other methods have to be used to infer it, such as using other pictures in which the information is clearly visible and infer from similar physical characteristics that the fuse types are identical.Fuse types have different physical characteristics, such as dimensions, body color, label color, shape, size, etc., that can help identify them.In other cases, the Google Images search engine can be used to identify the fuse type.For example, if the single-line diagrams show the fuse is rated at 90 amperes, is used at 600 volts, and is a Class J current-limiting fuse, then these information can be searched in Google Images and the results can be parsed to find a fuse that matches the physical characteristics found on the picture.[see Fig. 1(f), where the fuse type Ferraz Shawmut A4J and 200 Amp rating can be clearly identified from a Google Images search result].
This manual task of inferring fuse information from physical characteristics based on other examples is a clear case where deep learning and object detection can be applied to automate the process.In this article, our objective is to show that object detection neural network models can be applied to this task.In order to maximize the potential benefits of our proposed approach, we want to find the optimal model architecture and hyperparameters to obtain the highest possible performance on this task.Overall, our work led us to create a graphical user interface (GUI) that users can use to perform inference and identify fuse types in new survey pictures using our final optimized model.By using deep learning techniques to automatically identify fuse types, clear pictures of the label text on fuses are no longer required, as the types are identified using physical characteristics only, such as shape and color.Therefore, pictures can now be taken from odd angles and from farther away outside the danger approach boundaries, which can improve the safety of workers when performing surveys of live electrical equipment where shock and arc flash hazards are present.This work is the first of its kind to develop an object detection model based on deep learning that can improve the safety of workers during arc flash surveys by simplifying the survey process and lessening the need for clear pictures of live equipment label text.
In order to evaluate different model architectures and hyperparameters, object detection evaluation metrics from the COCO challenge [4] were used, which are the metrics that are widely used in object detection studies [5].These metrics are in the range [0, 1], with the higher values representing a higher performance.Notably, the average precision (AP) metric which averages AP values with an intersection-over-union (IoU) threshold between 0.5 and 0.95 in increments of 0.05 was used as the base performance metric in this work, as commonly done in literature.For the interested reader, more details on different object detection metrics can be found in Supplementary Section 1.2.
The rest of this article is organized as follows.Section II contains a description of the concepts of deep learning used in the context of object detection.Related work in the area of using deep learning for object detection in industrial settings are presented in Section III.The experiment setup divided into four major phases and details on the methodology of this article are presented in Section IV. Results obtained for each experiment phase are presented in Section V. Section VI contains a discussion on the experiments.Finally, Section VII concludes this article.

II. DEEP LEARNING FOR OBJECT DETECTION
The inception of multilayer neural networks, or what is colloquially called deep learning, uses the principles of backpropagation to update neuron weights in order to minimize a loss (or objective) function [6].In this study, we used a supervised learning approach, in which each image sample were manually labeled to indicate the ground truth, which corresponded to the type (or class) of the fuse and its bounding box coordinates pixelwise in the image: x min , y min , x max , and y max .
In a typical deep learning object detection workflow, the samples are separated into three splits: 1) training, 2) validation, and 3) testing.The training split is used to update the model weights, the validation split is used to track the performance of the model after each epoch and the testing split is used to confirm that the model can generalize correctly on new data after the last epoch.An epoch corresponds to a full pass of the entire training split through the model once, and typically neural networks are trained for multiple epochs.
The loss functions which are most commonly used in multiclass object detection tasks are based on the cross-entropy loss function [see (1)], in which y i are the ground truths, ŷi the predictions of the model, and C the number of possible output classes.The loss function typically includes a regularization component, such as a Ridge (or L2) component, called weight decay, where λ is the weight decay intensity, w the weights in the model architecture, and W the total number of weights in the model architecture First, a batch of samples passes through the model with a forward pass and predictions for the class are obtained at the classification head output layer and the predictions for the bounding box coordinates are obtained at the regression head output layer of the model.The classification head outputs a class and a score within the range [0, 1] for each predicted box, and the latter corresponds to the objectness of the box which evaluates the membership or probability to a set of object classes versus the background.From these predictions, the loss function is computed, and a backward pass is performed to adjust the weights in the inverse direction of the gradient of the loss function with regards to the weights.This process is called a stochastic gradient descent (SGD), in which the learning rate corresponds to the rate of change of the weights for each epoch [see (2)], where w corresponds to the weights, t corresponds to the epoch number, and η to the learning rate (or rate of change of the weights for each epoch, which corresponds to steps on the loss function surface of dimensionality W ) SGD has been greatly improved over the years and newer methods taking into account the adaptative momentum of the gradient descent have been proposed, such as the Adam optimizer [7].The learning rate can also vary during the training and is typically higher during the first epochs and decreases based on a learning-rate scheduler following certain predetermined patterns, such as cosine annealing.In addition, in order to improve generalization of the model and reduce model performance sensitivity due to the seed choice, a stochastic weight averaging (SWA) procedure can be used during the final epochs of training [8].
Image classification corresponds to the task of identifying the class of an entire image, for example to classify if the entire image is of a car or a plane.For this task, multiple architectures based on convolutional neural networks (CNN) have been proposed.In CNN-based architectures, the model weights that are learned during the training phase correspond to the convolution kernel filters of each feature map.Object detection corresponds to the task of identifying the class and location of objects that can be found within images, for example finding each individual car and their position (specified by a bounding box) inside a larger image containing multiple other objects.
Overfitting is a concept that occurs when the model being trained learns to perform well on the training data only, but lacks in its ability to generalize on new data it has never encountered before.To prevent overfitting, multiple solutions have been proposed, such as data augmentation, which is a technique where the training samples are subtly altered during the training process, for example, by randomly changing the brightness, saturation, or contrast of the image.The proper use of data augmentation has been shown to greatly reduce overfitting and allow trained models to generalize well on new data [9], [10].
Before training, the neural network model weights can either be initialized randomly or can be used from a training that has been performed on a separate dataset, such as the ImageNet [3] or COCO [4] datasets.Transfer learning, which corresponds to the application of these pretrained models on different tasks, has been shown to vastly improve model performance, since common concepts, such as textures, corners, or edges can be transferred easily from one object classification or detection task to another [11].

III. RELATED WORK
In the past, automatic fuse type identification has been applied in the automotive industry for small fuses located in fuse-boxes rated in milli-amperes [12], which is different from high ampacity and high voltage fuses rated in tens or hundreds of amperes, as in this work.There have been multiple efforts to use object and fault detection in the industry, including on the detection and differentiation of capacitors of different capacities [13] using a slightly modified version of the YOLOv3 method [14], wherein the DarkNet53 architecture was replaced with the MobileNet architecture [15] to increase the speed and accuracy of the detection.
Others have used the Bayesian classification to detect faults in the appearance of corks [16], or statistical features computed Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
using color-texture analysis to efficiently classify industrial objects have been used [17], [18].The above methods working in tandem could help in the detection and classification of fuses based on their shape and color.Multiple studies have been previously published on object recognition for industrial robot applications.For example, attempts have been made to show how industrial robots in manufacturing processes can use object recognition techniques to automatically detect objects of variable characteristics, shapes, and sizes using neural networks based on various lighting conditions [19].
Models based on CNN used for image classification, such as AlexNet [20], have been adapted to the task of object detection with architectures, such as faster R-CNN [21] and RetinaNet [22].More recently, models using transformers and attention mechanisms [23] have been applied to object detection tasks, such as DETR [24].These three architectures will be compared in this article, starting with the conventional convolutionbased models faster R-CNN [21] and RetinaNet [22] as a baseline and, in additon, the more recent transformer-based DETR [24].All model backbones use ResNet-50 architectures, which are deep neural networks that use residual connections [25].
In another work, the authors used CNN to automate the sorting of objects on a production line at high speeds performed by robots [26].The task is similar to fuse detection, since the objects can vary by colors, shapes, and sizes.Similarly, a study has shown that techniques using CNN can be used to classify objects based on reference objects and authors have been able to distinguish between screws and nuts with a high degree of accuracy [27].
To our knowledge, work on automating power fuse model detection has never been attempted before in literature.Moreover, studies on the use of object detection models to improve worker safety when conducting arc flash surveys have never been proposed before.In our case, we use state-of-the-art object detection models and focus on applying a detailed methodology which is aimed at obtaining a final optimized model.This allows us to identify the best possible hyperparameters in order to reach the highest possible performance, since our objective is to train a final model that can be used in real-world industrial engineering applications, allowing workers to identify fuse types not from label text but from physical characteristics only, therefore reducing the risk of shock and arc flash by enabling workers to take pictures farther away from danger.With greater distance between themselves and cabinets under scrutiny, workers may be subject to lower severity arc energy per unit area thus improving their own safety and working conditions, since heavy personal protective equipment requirements can be lightened.

A. Experiment Setup
In order to find the optimal model architecture and hyperparameters that maximize the performance of this object detection task (i.e., fuse detection and classification), the data were split into different sets and a four-phase experiment setup was devised (see Fig. 2).
1) Datasets: A total of 6039 survey images were gathered from past arc flash projects of the CIMA+ engineeringconsulting firm.From these images, we selected the ten most common classes from our gathered data allowing us to analyze more than half the images.We therefore ended up selecting 3189 images to create the "Survey Dataset" allowing us to cover 52.8% of all gathered images.The "Survey Dataset" was then anonymized using a blurring tool [e.g., see Fig. 1(b)] to remove any potential information that could be linked to specific client equipment identification.
This dataset was then randomly split into two sets, namely, the "Learning Set" containing 90% of the images, and the "Holdout Set" containing 10% of the images.The split sizes of 90% and 10% were chosen to ensure that the "Holdout Set" contained a large enough representation of the rarest fuse classes to allow proper evaluation of the final model's performance on a large enough number of samples for each class, while at the same time ensuring that a maximum number of samples are used to construct a final model with maximum performance.The "Holdout Set" was used at the very end of the experiments to validate if the final model had not overfitted on the "Learning Set" and to evaluate if the model will be able to be used in practice and generalize to new survey images.
In this article, another dataset was also created by gathering 1116 pictures of the ten selected classes from the Google Images search engine (the "Google Images Dataset") using the google_images_download Python image parsing package.The "Learning Set" and the "Google Images Dataset" were combined to create the "Augmented Learning Set," which was the main set used in each phase experiment, except for Phase B where the "Learning Set" was also used to evaluate if the "Google Images Dataset" increases the accuracy of the model or not.In the "Augmented Learning Set," the subpart containing the "Google Images Dataset" was only used in the training phase and was excluded from the validation and testing phases.
Finally, the bounding box location and class of every fuse in each picture were manually annotated by an expert in the field of electrical power installations using the Colabeler tool.The classes are unbalanced, meaning that the number of samples per class in the dataset vary.Each image can contain a single or multiple fuses of different classes.In total, 12 109 individual fuses were labeled, see Table I for the number of samples per class.

Fig. 2. Experiment Setup
Phase A: Base Model Hyperparameters.Find initial model hyperparameters using a stratified random subsampling single split: model architecture, initial learning rate, weight decay and data augmentation intensity.Phase B: Model Optimization.Find additional optimized model hyperparameters using a single split with fivefold stratified cross-validation: image resizing size, using a pretrained model or random initial weights and including the "Google Images Dataset" or not.Phase C: Sensitivity Analysis.Test the sensitivity of random initialization seed on performance with 10 different initialization seeds using a stratified random subsampling single split.Phase D: Final Evaluation.Perform a final training of the model using a stratified random subsampling single split and test the accuracy of the final model on the "Holdout Set."

TABLE I DATASETS: NUMBER OF SAMPLES PER CLASS
The initial learning rate values evaluated were chosen based on a logarithmic scale [28] and selected not to be too small, which would greatly reduce the rate of error reduction during training, and not too large so that divergent oscillations would occur [29] based on tests performed in various preexperiments.The weight decay values were chosen based on a logarithmic scale and selected as a mix of typically recommended values for complex and less complex datasets and models [30], and the values outside this range were evaluated as too large and detrimental to performance or too small and having no impact on performance in various preexperiments.The data augmentation values were chosen to yield transformations that were noticeable but not too large as to make input data illegible, and to find the optimal data augmentation intensity for the optimization of the final model's performance.
This phase was performed using a stratified random subsampling method with a single split, using a validation size of 10% and a testing size of 10%.The validation and testing split sizes were chosen to be high enough so that the rarer fuse classes would have enough samples being evaluated, which was critical based on our very unbalanced datasets (see Table I).However, in order to attain the highest possible model final performance, we chose to include a large portion of the set in the training split (80%) so that the model could be trained with as much data as possible.Section II and Fig. 2-Phase A further depict the definition of the validation and testing splits.
This phase allowed us to find the best possible combination of model architecture, initial learning rate, weight decay, and data augmentation intensity based on the best epoch validation AP.These parameters were then fixed and further model optimization were performed in the latter phases.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
This phase was performed using a single split with a fivefold stratified cross-validation, which allowed us to obtain a mean and a standard deviation on the performance of the possible values for these three hyperparameters, using a validation size of 18% and a testing size of 10%.Section II and Fig. 2-Phase B further depict the definition of the validation and testing splits.This phase allowed us find the optimal image size, the performance difference when using weights from a pretrained model or random model weights and the performance difference when including the "Google Images Dataset" or not to train the model.
4) Phase C: Sensitivity Analysis: It is critical to identify if the random number generation (RNG) initialization seed used in the previous runs had an impact on the performance of the model.Other studies suggest that the random seed used for initialization can have an impact on performance and therefore becomes an hyperparameter in itself [31].We therefore evaluated the initialization seed sensitivity in this phase by training the model with ten different random seeds that are used to initialize the RNG in the code, including the PyTorch, numpy, and random packages in Python.This phase does not affect the splitting seed, which was set at 54 288 in all experiments using a stratified random subsampling method, a validation size of 10%, and a testing size of 10%.Section II and Fig. 2-Phase C further depict the definition of the validation and testing splits.This allowed us to find coefficients of variation (CV) for our model's performance and select the optimal initialization seed to train our final model.
5) Phase D: Final Evaluation: Finally, in order to test if our optimized model can generalize on new data, we trained the model using a stratified random subsampling method with a single split, with a validation size of 10% to allow us to save the model at its best epoch.Section II and Fig. 2-Phase D further depict the definition of the validation and testing splits.After the training, an inference test was performed on the "Holdout Set" and final metrics were obtained.

B. Global Hyperparameters and Technical Setup
In all experiments, 200 epochs were used without early stopping, all training images RGB values were normalized based on precalculated mean and standard deviation values and the Adam optimizer [7] was used.A learning rate scheduler with a cosine annealing schedule was implemented to gradually reduce the initial learning rate for the first 75% of the epochs and then stay constant for the remaining 25% of the epochs [8].After 75% of the epochs, a stochastic weight averaging (SWA) procedure was implemented [8].The seed used to separate the data between training, validation, and testing splits was fixed at 54 288 for all experiments (as stated in Section IV-A4).
Images in the datasets have an original resolution varying from 140 × 105 to 5184 × 3888 pixels.In order to allow mini-batch processing of the images and stacking on the graphics processing unit (GPU), all images were resized to a fixed size of either 512 × 512, 1024 × 1024, or 2048 × 2048 pixels and by using zero-padding on the missing pixels.Images larger than these sizes were downsized using a bilinear method.Since our model was trained on resized images at a fixed size, resizing was always performed before the model was used for inference and this resizing step was built into the training and inference pipelines, including the GUI.The pipelines were developed in Python with the PyTorch deep learning framework.All computations were performed in a Linux-based environment with a Nvidia RTX 3090 for image sizes of 512 × 512 and 1024 × 1024, and a Nvidia Quadro RTX 8000 for an image size of 2048 × 2048.Execution times for all experiments can be found in Supplementary Section 2.
The data augmentation method used was the ColorJitter method in PyTorch, which varied the brightness, contrast, and saturation of the image within a range of ± the intensity specified.For example, for a data augmentation intensity of 0.1, the brightness of the training samples were altered randomly within a range of 90% to 110% of the original brightness.Since the fuse types had different colors and the color hue could be a useful indicator of the fuse type, the hue was not altered during data augmentation.For faster R-CNN and RetinaNet, the batch size was fixed at 20, which was the highest possible value allowed in the 24 GB of VRAM of a Nvidia RTX 3090.Since DETR had higher memory requirements, the batch size had to be lowered to 14 to fit in the 24 GB of VRAM of an Nvidia RTX 3090.
In order to choose hyperparameter values during each phase, the epoch resulting in the highest AP value in a given validation split (also referred to as "the best epoch" in the rest of the text) was considered in all cases.Testing AP results, while not being used for hyperparameter values selection, are also presented to show that the models do not overfit on the validation split.

A. Phase A: Base Model Hyperparameters
The calculations for Phase A allowed us to find the base model hyperparameters for the rest of this study, namely, the model architecture, initial learning rate, weight decay intensity, and data augmentation intensity.The validation results curves for the best combinations of low-level hyperparameters (initial learning rate, weight decay intensity, data augmentation intensity) for each model architecture are shown in Fig. 3 II (bold: chosen hyperparameter values).We then evaluated these models on the testing split to validate their generalization potential and show these results in the last three rows of Table II.
The faster R-CNN architecture performed better on the validation split than the DETR architecture, while the inverse was true for the testing split.Since the hyperparameters choices were made on the best validation split performance, the faster R-CNN architecture was selected for the subsequent experiment phases.RetinaNet was the lowest performing architecture in both the validation and testing splits.The following combination  II.
of hyperparameters allowed us to obtain the highest validation AP and was therefore been chosen for the next phases of this study.
3) Weight Decay: 3×10 −5 .Detailed results for this phase can be found in Supplementary Sections 3 to 9. For the interested reader, more details on the Faster R-CNN architecture that was selected for the final model can be found in Supplementary Section 1.3.

B. Phase B: Model Optimization
In this phase, we further optimized the model parameters using stratified fivefold cross-validation to evaluate the impact of image resizing size, pretrained models versus training models from scratch, and of the use of the extra "Google Images Dataset" on model performance.The results for Phase B are shown in Table III (bold: chosen hyperparameter values) and represent the mean and standard deviation of the AP metric for all combinations of high-level hyperparameters (image size, pretrained backbone, and augmented dataset with Google Images) in the validation and testing splits (at the best epoch found in the validation split) for the faster R-CNN architecture chosen in Phase A.
The 1024 × 1024 image size attained the highest mean AP in the validation split.In the testing split, 2048 × 2048 had a slight but not significant advantage over 1024 × 1024 in terms of mean AP.Since the GPU VRAM requirements for an image size of 2048 × 2048 were higher than for an image size of 1024 × 1024 for training and inference (VRAM of RTX 3090 was not sufficient and the use of an RTX QUADRO 8000 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE IV PHASE C RESULTS: SENSITIVITY ANALYSIS
was required), the 1024 × 1024 image size was chosen for the next phases based both in terms of limiting excessive material requirements and maximizing model performance.We observed that including the "Google Images Dataset" and using pretrained models increased the model's performance in both the validation and testing splits.Therefore, the "Google Images Dataset" was included and pretrained models were used in the next phases of this study.
Overall, the following combination of hyperparameters allowed us to obtain the highest validation AP and was therefore chosen for the next phases of this study.
3) "Google Images Dataset" Included: True.Detailed results for this phase can be found in Supplementary Sections 10 to 16.

C. Phase C: Sensitivity Analysis
After the optimal model parameters were determined in previous phases, we evaluated the sensitivity of our chosen faster R-CNN model against variations of initial seeds.The results for Phase C are shown in Table IV (bold: chosen hyperparameter values) and represent the AP values obtained on the validation and testing splits for ten different initialization seeds.The CV calculated on the validation and testing results are also presented.The CV of 0.60% on the validation split and 0.75% on the testing split, as presented in Table IV, are less than 1%.In Table IV, the initialization seed for which the best validation AP is obtained is 2,003.Therefore, this seed was used for the final training of Phase D.
Detailed results for this phase can be found in Supplementary Sections 17 and 18.

D. Phase D: Final Evaluation
Finally, this phase allowed us to train a final optimized model using the "Augmented Learning Set" based on the hyperparameters determined in all previous phases.We then evaluated the generalizability of this final model on unseen data from the "Holdout Set."The AP results for Phase D are shown in Table V.

TABLE V PHASE D RESULTS: FINAL EVALUATION
The results show that with an AP 50 of 91.06% on the "Holdout Set" test samples, the fuses were appropriately identified with a predicted box with an IoU greater than 0.5 with regards to the ground truth box 91.06% of the time, which shows that our model has the capacity to generalize on new survey pictures.Fig. 4 shows examples of inference results displayed in the GUI on the "Holdout Set," showing the ground truth and the predicted bounding boxes.In most situations, the final model performed well and correctly identified all fuses when the pictures were at high resolution and in focus [see Fig. 4(a)], but in some cases could result in bad predictions when the pictures were at low resolution and out of focus [see Fig. 4(b)].The model also had a high prediction performance in other situations where the picture quality was poor, for example in situations of low brightness and occlusion, similar to the cases shown in Fig. 1 Detailed results for this phase can also be found in Supplementary Section 19.Supplementary Section 19.1 also estimates the impact of different split sizes on the accuracy of the final model.

VI. DISCUSSION
A recurrent problem in the electrical industry exists with identifying electrical equipment from photos, including the task of identifying fuse types for arc flash studies.The use of modern object detection techniques based on deep learning are a successful way to tackle this issue.Multiple studies dealing with the issue of object detection in industrial settings were published, proposing different machine learning solutions to automate manual image classification and detection.With this study, we developed a new dataset of fuse pictures to feed into a supervised learning pipeline.We devised a methodology that allowed us to identify which model parameters are better optimized for this specific task.This led us to create a final model that can be used in practice in industrial settings.Our results are fully repeatable, as we openly share the code and datasets with this article.
In Phase A of this study, we wanted to identify the initial model hyperparameters in order to maximize the performance of the model.We found that faster R-CNN [21] was the bestperforming architecture for this specific task (see Table II).The poorer performance of DETR when compared to pure CNN models could possibly be attributed to its lower performance for small objects due to the use of a single layer from its CNN backbone [24].The CNN-based architectures, namely, faster R-CNN and RetinaNet, reached a high AP in a fewer number of epochs than the DETR transformer-based architecture, as shown in Fig. 3.In addition, the mean loss per epoch remained higher through the training for the DETR architecture than for the RetinaNet and faster R-CNN architectures.This could be explained by the fact that the DETR architecture implemented a Hungarian loss function [24], while the RetinaNet and faster R-CNN architectures both implemented a loss function based on cross-entropy [21], [22], therefore the unit-less losses were not scaled between CNN-based and transformer-based architectures.
In Phase B, we performed tests to determine the optimal input image resizing size, as well as ablation experiments with regards to the use of transfer learning and the inclusion of the "Google Images Dataset" in the training samples.We tested different input image sizes, and by taking into consideration the cost in terms of GPU requirements versus the increase in performance (see Table III), we chose an optimal image size of 1024 × 1024.This image resizing is automatically done in both the training and inference pipelines, therefore the models can perform well on input images of any size.Furthermore, we observed that using transfer learning had a large positive impact on model performance.Both in the validation and testing splits, using a pretrained model on the ImageNet [3] and COCO [4] datasets yielded a peak AP performance increase in the order of 10%, which was a considerable improvement.This suggests that concepts and features learned on other datasets can transfer over to new datasets.Moreover, the use of additional training data scraped from the web (Google Images) also allowed us to slightly increase the performance of our final proposed model, as presented in Table III.It could be observed that the inclusion of the "Google Images Dataset" in the training split increased the mean AP in the order of 1-2%, either in the validation or testing splits, which was below what we expected.The expansion of the "Google Images Dataset" could potentially increase the performance of the model and we encourage the community to help us increase the size of this dataset as new pictures become indexed on the web.In general, images in the "Google Images Dataset" were zoomed-in, focused, and high brightness samples, while images in the "Survey Dataset" could often be blurry, with low brightness and fuses could be scattered in subparts of a larger image, as shown in Fig. 1.In our study, the objective was to find optimal model parameters that gave the highest performance in the task of identifying fuse types in new survey pictures, validation and testing splits only contained images from the "Survey Dataset" and did not contain images from the "Google Images Dataset."Therefore, the "Google Images Dataset" contained fuses in a context which was different than what was being evaluated, and this could explain why the inclusion of this dataset did not yield a higher boost in performance.Nonetheless, we believe that the addition of images scraped from the web could potentially be more effective for other datasets of industrial objects where the discrepancy between the images being evaluated (here, the "Survey Dataset") and images scraped from the web (here, the "Google Images Dataset") is less pronounced.
In Phase C, we tested the sensitivity on the performance of our model when changing the value of the random initialization seed.During this phase, we found the model to be robust, as shown in Table IV, which could potentially be explained by the use of SWA.While SGD generally converges to a point near the boundary of the wide flat region of optimal points, SWA on the other hand is able to find a point centered in this region, leading Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
to solutions that are wider than the optima found by SGD and more robust to variations of initialization seed [8].
In Phase D, we trained a final model using the parameters identified in previous phases of the study.In the end, we observed that the final model could generalize on a holdout set, as presented in Table V, with an AP of 72.51%, AP 50 of 91.06%, and AP 75 of 83.64%.We can therefore expect that the model will perform well in practice on never seen images.Besides, in Table V, we see that the performance for small boxes (AP S , area below 32 × 32 pixels) was much smaller than for medium (AP M , area between 32 × 32 and 96 × 96 pixels) and large (AP L , area above 96 × 96 pixels) boxes.This was likely due to the fact that small boxes contained less pixels and were harder to predict than medium or large boxes.However, this performance discrepancy between boxes of different sizes was expected, as it is commonly obtained for COCO AP metrics in the literature [5].Ultimately, the final model had a slightly lower performance on the validation split of the "Augmented Learning Set" when compared to the "Holdout Set," which may be due to slight overfitting on the "Augmented Learning Set."This overfitting was in the order of 4% when using AP as a comparison metric, which corresponded to the difference between the AP of 72.51% obtained on the "Holdout Set" and the AP of 76.76% obtained on the validation split of the "Augmented Learning Set," as shown in Table V.As a reminder, the AP metric was obtained by computing the average of the AP for IoU thresholds ranging from 0.5 to 0.95 in increments of 0.05, which is the typical benchmark metric used in object detection studies [5].In fact, object detection models provide two outputs: 1) a classification results (i.e., fuse type) and 2) bounding box coordinates.However, in an industrial context of fuse type identification, the classification component is critical, but the bounding box accuracy IoU score matters less.We estimate that engineers can be satisfied with the model predictions if the classification is accurate and the IoU score of the predicted and ground truth boxes is of 0.5 or more.In this sense, we consider our true final accuracy (or classification potential) to be the AP 50 of 91.06%.Using the GUI that was developed in this phase, users can now apply our final model to perform inference on pictures to help them identify low-voltage fuses types.
Based on the industrial experience of the authors, it is possible to estimate the efficiency gain of our new proposed approach.When using the final trained model, the inference time for a single picture is in the order of maximum 3 s, which can vary depending on the input image resolution and computational resources used.Since the model is accurate 91.06% of the time, we also have to include a delay to take into account the verification of the model prediction by the engineer, which can be in the order of 30 s per fuse.In around 50% of the cases, the fuse label is clear and the engineer can directly identify the fuse in up to 30 s.In the other 50% of the cases, a manual analysis of the same image, when the fuse labels are unclear, will require the engineer to find a fuse of the similar physical characteristics in another survey image or by using the Google Images search engine, which can take up to 2-10 minutes per image, depending on the experience of the person performing the analysis.Based on these time metrics, the time required to perform fuse identification for a project can be reduced by 42% to 72% using our new proposed method when compared to baseline.Assuming a small electrical installation containing 60 sets of three-phase fuses, the fuse identification task could be reduced from around 4-16 h to around 2-4 h (see Supplementary Section 20 for more details on the efficiency gain estimation).This productivity gain could help reduce the cost and time required to perform arc flash studies for the many types of workers involved in producing arc flash studies, such as industrial users, utilities, and engineering-consulting firms.
In terms of limitations of this study, a major time expenditure in this project was to manually label all samples in the datasets.Instead of a supervised learning approach, a semisupervised learning approach, using active learning for example, could have been implemented.In addition, few-shot learning models could have been used in order to reduce the size of the input datasets.Moreover, MobileNet architectures could have been tested to validate performance on smaller architectures which were designed to be implemented on mobile devices [15].Another limitation of this study is the fact that only the ten most common fuse types in our gathered data were considered.In reality, much more fuse types exist in the electrical industry and it would have been interesting to consider several more classes for this project, including rarer fuse types and evaluate the prediction results on common versus rare fuse types.In addition, the "Survey Dataset" was created from surveys performed mostly in Eastern Canada, and therefore the fuse types in this study were biased toward the most common manufacturers in this specific region of the world.Furthermore, as previously discussed, our final model presented light overfitting.While not critical, it is possible that this overfitting could be reduced by adding other data augmentation methods, such as rotations, flips, mirrors, random crops, etc.In our work, we limited ourselves to the ColorJitter data augmentation method, as it could create new training samples to simulate different camera flash intensity, lighting conditions, and camera specifications that could be realistically obtained when performing arc flash surveys in an industrial setting.This overfitting could also potentially be reduced by reducing the number of training epochs.Finally, additional computing resources would have allowed us to perform a wider hyperparameter search and possibly reach a higher final model performance in the end.
Next, we identified future work which could build upon this study.The new dataset created during this project can become a new benchmark for object detection tasks in industrial settings.By sharing the dataset with the community, our intention is that the dataset can grow and expand over time, ideally including fuses from other regions around the world as well.Note that the class imbalance in our dataset (see Table I) is not a limitation, as it is representative of the reality for practicing engineers: the types of fuses examined via pictures taken on site during arc flash surveys will be generally considerably unbalanced as well, so our dataset is representative of reality in the field.In addition, it could be interesting to develop a new performance metric similar to AP but based on classification results only, regardless of the bounding box coordinates predicted, for object detection applications, where classification matters significantly more than box coordinates prediction.Moreover, it would be interesting to include the identification of fuse ampere rating from survey pictures using optical character recognition (OCR) techniques to identify the ampere ratings printed on the labels when visible.One avenue to explore would be to subdivide all fuse classes by ampacity ranges, since the fuse physical dimensions for a same class can vary depending on an ampere rating of 0-50 Amps versus 50-100 Amps for example.Finally, pretraining the model using the "Google Images Dataset" and then fine-tuning it with the "Survey Dataset" could be considered to evaluate if a transfer learning approach based on web data could improve model accuracy for this specific application.In fact, we believe that the potential to increase the size of the "Google Images Dataset" is greater, as the data does not need to be gathered on the site of live electrical installations (unlike the "Survey Dataset" which requires on-site visits) and can be scraped directly from the web.We believe this proposed step is interesting for object detection in industrial settings, since many industrial objects are widely available on image search indexing websites in various catalogs and the use of this data source could be leveraged to help in the difficult step of data gathering in industrial and potentially dangerous environments, such as live electrical power installations, where dangerous voltage levels are present.
Overall, we proposed a complete experimental design and pipeline allowing the creation of a final model that can be directly used in industrial settings with the help of a GUI.Our final trained model remains imperfect and can, in some edge cases, wrongly classify other objects as fuses, classify fuses as the wrong types, or ignore fuses without identifying them.Therefore, our model cannot be used standalone as an automatic system and an electrical engineer needs to validate its output to ensure that the fuse types predicted are adequate.However, the final model proposed in this work can still significantly accelerate the engineer's work for this task, while at the same time lowering the risk of electrical hazards by allowing workers to take pictures farther away from danger, since the fuses can now be identified using physical characteristics only and clear pictures of the label text are no longer needed to identify the fuse types.We believe our work will help in the adoption of state-of-the-art object detection methods in industrial applications.

VII. CONCLUSION
In this article, we presented a common issue in the electrical industry, namely, the difficulty of identifying fuse types from survey pictures when conducting arc flash studies.We proposed a solution based on state-of-the-art object detection methods to construct a final model with high prediction performance.To achieve this goal, we created a new dataset of manually labeled images of fuses and made this new dataset openly accessible to the community.Our code and implementation in PyTorch has also been shared, including a GUI that can be used to directly apply the best optimized model we obtained in our experiments on new survey images.We believe our proposed method and implementation will accelerate the work of electrical engineers while also reducing the risk of human error by leveraging the potential of modern deep learning object detection techniques.
Since taking clear pictures of the text on the fuse labels is no longer necessary, this new application can also help to reduce the potential risk of exposure to electrical hazards when performing arc flash surveys by allowing workers to take pictures farther away from dangerous live electrical equipment.

Fig. 1 .
Fig. 1. (a)-(e) Examples of fuse pictures from the "Survey Dataset."This dataset contains some pictures in which fuses are difficult to identify due to poor quality (i.e., blurriness, low brightness, occlusion) and are shown here with red arrows, while others are easy to identify.(a) Three fuses of type "Ferraz Shawmut CRS" are partially cut from the top-left of a picture of a disconnect switch (occlusion).(b) Three fuses of type "Ferraz Shawmut AJT" are very slightly visible from the center window of a disconnect switch (low brightness, occlusion).(c) Three fuses of type "Ferraz Shawmut AJT" are blocked by plastic parts inside a motor control center cubicle (low brightness, occlusion).(d) Two fuses of type "Gould-Ferraz Shawmut A4J" can be seen inside the partially open door of a disconnect switch (blurriness, occlusion).(e) Three fuses of type "Gould-Ferraz Shawmut A4J" can be clearly identified inside a disconnect switch.(f) Example of a fuse picture from the "Google Images Dataset."This picture scraped from an online catalog created by the manufacturer found on Google Images shows a clear, high-resolution front-view picture of the fuse type "Gould-Ferraz Shawmut A4J."
, which shows the AP per epoch, mean loss per epoch and learning rate per epoch.From the results shown in Fig. 3, we chose the model at the best epoch based on the highest validation AP (faster R-CNN: epoch 186, RetinaNet: epoch 131, DETR: epoch 198), these results are shown in the first three rows of Table

Fig. 3 .
Fig. 3. Phase A: Validation Results Curves For Best Model Architecture Hyperparameters.From top to bottom, we observe the AP, mean loss, and learning rate per epoch for the best performing hyperparameter combination for each architecture evaluated: Faster R-CNN, RetinaNet, and DETR.From the top chart, we chose models at the best epoch based on the highest AP (Faster R-CNN: epoch 186, RetinaNet: epoch 131, DETR: epoch 198) and subsequently evaluated those models on the testing split to investigate their generalization potential, as shown in TableII.

Fig. 4 .
Fig. 4. Final Evaluation Inference Results.Green: Ground truth bounding boxes and classes.Yellow: Predicted bounding boxes, classes and objectness scores.(a) Example of an in focus picture with high resolution.In this case, the final model performed well by correctly identifying the three fuse classes.(b) Example of an out of focus picture with low resolution.Here, the final model's performance was lower as it correctly identified only two out of three fuse classes (the top fuse was misclassified).(c) Example of a picture of poor quality with obstruction and low brightness.In this situation, the final model correctly predicted the three fuse classes.