Interpretable Anomaly Detection in the LHC Main Dipole Circuits With Nonnegative Matrix Factorization

CERN's Large Hadron Collider (LHC), with its eight superconducting main dipole circuits, has been in operation for over a decade. During this time, relevant operational parameters of the circuits, including circuit current, voltages across magnets and their coils, and current to ground, have been recorded. These data allow for a comprehensive analysis of the circuit characteristics, the interaction between their components, and their variation over time. Such insights are essential to understand the state of health of the circuits and to detect and react to hardware fatigue and degradation at an early stage. In this work, a systematic approach is presented to better understand the behavior of the main LHC dipole circuits following fast power aborts. Nonnegative matrix factorization is used to model the recorded frequency spectra as common subspectra by decomposing the recorded data as a linear combination of basis vectors, which are then related to hardware properties. The loss in reconstructing the recorded frequency spectra allows to distinguish between normal and abnormal magnet behavior. In the case of abnormal behavior, the analysis of the subspectra properties enables to infer possible hardware issues. Following this approach, five dipole magnets with abnormal behavior were identified, of which one was confirmed to be damaged. As three of the other four identified magnets share similar subspectra characteristics, they are also treated as potentially critical. These results are essential for preparing targeted magnet measurements and may lead to preventive replacements.

Fig. 1.Schematic view of the main dipole circuit, including the PC, the CB, and the CL.The QDS triggers an FPA, which deactivates the PC and activates the EE systems.Furthermore, it triggers the discharge of the QH in the respective magnet, if a quench is detected.The two EE systems EE 1 and EE 2 consist of a switch S EE , and an EE resistance R EE .The circuit is grounded at the center of the resistor R EE in the EE 2 system.The magnet with inductance L M and the by-pass diode D with a parallel resistance R P are in a liquid helium cryostat.Magnets are labeled by their physical position (P) from the left to the right.The electrical positions (E) are counted clockwise along the electrical connection starting from the PC.The numbering shown here is representing the circuits in sectors 12, 34, 56, and 78.In sectors 23, 45, 67, and 81 the electrical labels are inverted, as the PC is on the left side of the circuit.along its circumference.These dipole magnets are powered through eight separate circuits of 154 magnets each.To reach the nominal field of 8.0 T and a current of 11.85 kA, each magnet is cooled down to 1.9 K with superfluid helium.At this temperature, the magnet is superconducting.A resistive transition in a superconducting magnet, also called quench, results in local heating in the superconducting cables and high voltage transients in the magnet, which can possibly damage the magnet if not appropriately managed.In case of a quench or other powering failures in the circuits, a system of protection elements is in place to safely dissipate the energy in the quenched magnets and extract the remaining energy from the circuit [1].This process is referred to as a fast power abort (FPA) event.
To better understand the data recorded during an FPA event, the LHC main dipole circuits and their protection system are explained in more detail below.
Fig. 1 shows a schematic view of a main dipole circuit with its 154 magnets, each represented by a magnet inductance L M [2].For this analysis, the magnets are counted along their physical position from the left to the right or clockwise along the electrical connection starting from the power converter (PC).In case the PC is switched OFF, the current I circulating in the circuit bypasses the PC via the crowbar (CB).The current leads (CL) indicate the transition between the cold superconducting part of the circuit and the warm, normal conducting part of the circuit.
The protection system includes a quench detection system (QDS), which detects a quench and triggers the appropriate protection actions [3], [4].Upon the detection of a quench in a dipole magnet, the PC is switched OFFand the quench heaters (QH) of this magnet are activated.QHs are resistive strips attached to the outer surface of each magnet coil [5].They ensure protection by distributing the magnet's stored energy more uniformly over the quenched magnet windings [6].The by-pass diode D diverts current from the quenched magnet.This restricts the quenching magnet to only absorb its stored magnetic energy, not the energy of the entire circuit.The parallel resistance R P installed across each magnet, smoothens transient voltages during this process [7].To avoid the circuit's energy to solely discharge in the diode of the quenched magnet, the switches S EE in both energy extraction (EE) systems are sequentially activated [8].They direct the circuit current toward the resistances R EE , which extracts the circuit's energy within around 300 s.The voltages U M measured over the 154 magnets during an FPA event are shown in Fig. 2.These signals are voltage transients that contain information about the behavior of the electrical circuit and its components [9].
A quench is a routine occurrence during training periods aimed at increasing the peak magnetic field in the superconducting magnets [10], and rarely occurs during operation.Once a quench emerges, it is frequently accompanied by secondary quenches.Secondary quenches result from electromagnetic perturbations milliseconds after the initial quench [2] or from thermal propagation in the helium tens of seconds later [11].Following a quench or a secondary quench, the magnet is exposed to local heating, high voltages, and thermal expansion, depending on the circuit's energy level [12].The magnets are designed with additional margins for this case, but defects can still occur.
Certain hardware failures in the superconducting circuits can notably impact the availability of the LHC, potentially resulting in months of downtime.The understanding of normal and abnormal circuit dynamics helps to ensure safe quench mitigation and to detect precursors of hardware failures, allowing to schedule preventive maintenance.
To better understand the circuit dynamics, the local frequency responses of selected main dipole magnets have been measured and evaluated [13], [14].FPA events have been deliberately triggered in all main dipole circuits to better understand the voltage transients in the circuits in the absence of a quench [15].The main dipole circuits have also been extensively studied with electrical simulations, with the simulation of transient effects in accelerator magnets (STEAM) framework [16], [17].
While these simulations account for much of the circuit's behavior and measurements, some aspects-such as the voltage transients observed in the magnified view of Fig. 2 following activation of the EE systems-cannot be captured entirely by simulation.In this time window, secondary quenches frequently occur due to electromagnetic perturbations [2].A better understanding of the circuit behavior, and in particular, the voltage transients after triggering the EE systems, allows the development of mitigation strategies to reduce the number of these electromagnetically induced quenches and the risk associated with them.
The presented research aims to provide insights into the propagation and physical process explaining the observed frequency spectra of the magnet voltage after activation of the EE systems.Normal and abnormal behavior in these frequency spectra is detected and characterized.
The detection of normal and abnormal behavior and their associated physical processes is carried out by nonnegative matrix factorization (NMF).The choice of NMF is motivated by recent successful applications of data-driven models to predict quenches [18], to classify QH failures [19], or to model the voltage across magnets [20].
NMF aims at providing interpretable results, as the lack of interpretability is a frequent criticism to other data-driven methods [21], [22].The method was originally used to decompose pictures of human faces into coherent components like eyes, mouth, etc. [23].The decomposed components are additive and, therefore easy to understand by humans.NMF has been successfully applied to discover molecular patterns in genes [24], to separate different sources of a mixed acoustic signal [25], and to derive properties of galaxies from astronomical observations [26].In the context of this research, NMF is used to decompose the frequency spectra of the voltages recorded in the LHC's main dipole circuits during FPA events (see Fig. 2) and understand the physical processes causing them.The loss introduced by the frequency decomposition for each FPA event allows detecting and interpreting abnormal behavior in the circuits.
The rest of this article is organized as follows.In Section II, an overview of the NMF within the context of this study is given.In Section III, the results are presented by showing possible causal relationships between the distinct frequencies in the circuit and the circuit hardware.In addition, five abnormal FPA events are highlighted, and their characteristic frequencies are interpreted.In Section III-E, the risk of abnormal FPA events for the different hardware components of the machine is elaborated.Finally, Section IV concludes this article.

A. Available Data and Preprocessing
This subsection explains the selection and preprocessing of the measured voltage signals.After the FPA is triggered [see Fig. 2(a)], the first EE system is activated 0.1 s later [see Fig. 2(b)].The second EE system is activated after further 0.5 s [see Fig. 2(d)].The two periods analyzed in this study are the two voltage plateaus [0.2; 0.575] and [0.7; 1.075] seconds after the triggering of the FPA.These were chosen because the observed frequency spectra in the magnet voltages are not reconstructed by the existing simulation models, which are based on the current knowledge of the circuit's behaviour.Each of these plateaus covers P = 154 voltage signals recorded with a sampling rate of 1068 Hz over a length of 0.375 s.
All data used for this study have been recorded after 2017, as the activation times of the EE systems have been kept unchanged since then.In total, Q = 699 distinct FPA events have been used.These events are split into three categories: 48 events do not contain a quench, 494 events contain a single quench of one magnet in the circuit, and 157 events contain at least one secondary quench due to electromagnetic perturbations.The frequency spectra of the latter events deviate strongly from the others, therefore, only the 48 events without a quench and the 494 events with a single quench are compared to derive anomalies in Section II-E.On the contrary, the spectral components of all Q = 699 events are interpreted in Section II-D.Secondary quenches due to thermal propagation in the helium do not affect the frequency spectra in the voltage plateaus, as they occur at a later stage.Events with those secondary quenches are treated like events with a single quench.
The 699 events with voltage signals from 154 magnets for each of the two plateaus after the activation of the EE systems yield a total of M = 2 Each of these M = 215292 voltage signals is transformed into a frequency spectrum with N = 200 data points via a fast Fourier transformation (FFT), an efficient algorithm for computing the discrete Fourier transform [27].The Nyquist criterion allows showing frequencies of up to 534 Hz [28].In order to mitigate spectral leakage seven window functions including a windowspecific amplitude correction are compared.The window functions are: rectangular, Hanning, Hamming, Bartlett, Blackman, flat-top, and Tukey [29].An exponential trend x of the form is fitted with least squares [30] and subtracted from each individual voltage signal x with timestamps t.A corresponds to the amplitude of the decay, τ to the decay's time constant, and C to the offset.This exponential trend, which is best visible in the voltage signal with the highest amplitude in the magnified view of Fig. 2, corresponds to nonlinear effects in the magnets [2].Applying these preprocessing steps yields a dataset composed of M frequency spectra from Q FPA events, with N data points in each frequency spectrum that is processed by NMF.

B. Nonnegative Matrix Factorization
Using N and M defined above, let V be the input matrix, with entries v i,j for i = 1, .., N and j = 1, .., M .NMF decomposes the N × M matrix V into a product of a N × K matrix W and a K × M matrix H such that Here, W represents the spectral components and H their weights.The parameter K defines the number of spectral components.All elements w i,k and h k,j of the matrices W and H are constrained to be nonnegative, leading to the additive nature of the NMF decomposition [23].Fig. 3 illustrates this behavior for a simplified example: the K = 2 spectral components can be added to reconstruct the M = 3 frequency spectra.For simplicity, the shorthand representation ":" is used to identify all entries from one dimension, e.g., [W] i= [1,..,N ],k=1 = w :,1 .The two spectral components w :,1 and w :,2 , have their maximum of 1 at i = 2 and i = 20, respectively.
Optimizing W and H involves minimizing an elementwise similarity metric d * (•) between the input v i,j and the reconstructed input vi,j = K k w i,k h k,j .Three widely used similarity metrics are as follows.
Fig. 4. Squared Eu distance, the generalized KL divergence, and the IS divergence as a function of vi,j , assuming v i,j = 1.
Fig. 4 shows the three metrics as a function of vi,j with v i,j = 1.The Eu-distance squares the absolute difference, demonstrated by this example: d Eu (1, 2) = d Eu (100, 101).Fig. 4 (Eu) therefore shows a quadratic function, equally sensitive to vi,j greater or less than one.The KL-divergence reflects the relative entropy, corresponding to the energy in a system.This causes an increased sensitivity for under-estimation and a decreased sensitivity for over-estimation of the reconstruction vi,j [32].This is reflected in Fig. 4 (KL) by a larger distance metric for vi,j < 1 as compared to vi,j > 1.The IS-divergence is scale-invariant [34] as it compares relative differences, illustrated by d IS (1, 2) = d IS (100, 200).The effect of sensitivity observed for the KL-divergence is amplified for the IS-divergence, as is visible in Fig. 4 (IS) [33].The advantages and weaknesses of these properties will be discussed in Section III-A.Using d * (•), the reconstruction loss is obtained by All values of W and H are initialized with an average nonnegative double singular value decomposition (SVD) [35].The term "double" is derived from the use of SVD in approximating both matrices W and H.This method leads to faster convergence and more robust spectral components compared to random initialization.Any zero values derived by SVD are replaced by the global average of V as they would otherwise remain at zero during the consecutive multiplicative updates.These multiplicative updates of w i,k and h k,j are specific to each of the three similarity measures [25].
4) Squared Eu Distance: 5) Generalized KL Divergence: 6) IS Divergence: No NMF regularization [36], [37] is applied to avoid the risk of regularizing spectral components with small amplitudes and to minimize the number of parameters to be optimized that would unavoidably be added with regularization.

C. Spectral Component Identification
In this subsection, the methodology to derive the final spectral components W, their number K, and their corresponding weights H is described.19 different numbers of spectral components are investigated (K = 2, . .., 20).The exact spectral components to which the NMF algorithm converges, also depend on the choice of the three distance measures from (3)-( 5), and the seven types of distinct window functions.All three parameters are referred to as hyperparameters in the remainder of this article.In total, 399 possible combinations of parameters exist.
The number of spectral components K determines the resolution of the factorization.Choosing a larger K results in a reduced reconstruction loss.However, in the context of this project, separating the measured frequency spectra into common spectral components aims at representing different physical processes.Hence, more spectral components are only desirable if they can be mapped to separate physical processes.Ideally, one physical process should be represented by one spectral component.To choose K accordingly, an additional performance measure is introduced, based on prior research [39], [40].
This performance measure calculates the mean pairwise Chebyshev distance between column pairs of spectral components and column pairs of their weights.The Chebyshev distance shows the maximum value of the absolute differences between two vectors [38].For the spectral components, the average Chebyshev distance over all (K − 1)! possible pairs of spectral components is used as the performance measure dCh .An example to calculate dCh for the spectral components in Fig. 3(b) is shown as follows: Since W has only two columns, there is one possible column pair of spectral components, resulting in dCh = max(|w :,1 − w :,2 |) = 1.If the columns in Fig. 3(b) are subtracted, this is evident as their absolute difference is w 2,1 = w 20,2 = 1.Suppose K is increased by one, and the additional component w new :,1 , happens to be identical to w :,1 .In this case, w new :,1 represents the same physical process as w :,1 .Consequently, max(|w :,1 − w new :,1 |) is zero, leading to a decreased d new Ch on the right side of the example calculation above.This example shows that adding more spectral components K, which are not expected to come from different physical processes, gets penalized by the introduced performance metric.
The performance measure is derived similarly for the spectral component weights H in Fig. 3(c) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
For the existing H in Fig. 3(c), this is shown on the left side of the example calculation.There are three possible column pairs where the Chebyshev distance of, e.g., the first column pair is calculated by max(|h :,1 − h :,3 |) = 1.If the spectral component w new :,1 is added, the spectral component weights are adjusted to obtain the same reconstructions.This is illustrated by the right side of the example calculation above.The Chebyshev distance of max(|h :,1 − h :,3 |) is reduced to 0.5, affecting the average d new Ch = 5 6 accordingly.Again, the performance measure indicates that K should not be increased.For computational efficiency, this performance metric is not evaluated for all (M − 1)! combinations of M = 215292 frequency spectra, but for M * = 1000 randomly chosen frequency spectra.In addition to the performance metric, the final choice of spectral components is based on a manual inspection of the identified components.This will be discussed in Section III-A.

D. Spectral Component Interpretation
This subsection describes the method of identifying the physical process behind a spectral component.For this purpose, the location of the maximum, the average of the maximum amplitude, and the weight propagation of the spectral components are analyzed and discussed.These allow relating a spectral component to hardware behavior in the LHC main dipole circuits [41].
For each FPA event q, the location of the maximum is defined as the magnet position with the highest of the P = 154 weights of a spectral component k.The average weight at this position over a selection of FPA events is defined as the average maximum amplitude.A distinction is made between maxima near the quenched magnet, the PC, or the two EE systems.
From the magnet position at which the maximum occurs, the spectral component can propagate to its physical or electrical neighbors.This results from the fact that physical magnet neighbors can experience electromagnetic coupling due to gaseous helium flow between adjacent cryostats, or instrumentation cables and other equipment being installed in their close vicinity, even if they are not directly electrically connected.The magnets are therefore labeled both in their physical and electrical order (see Fig. 1).
The type and direction of the propagation give insight into the mutual interaction of circuit components during an FPA.A distinction is made between the following propagation types.
1) Propagation along the electrical position of the circuit: If the weight H of a spectral component decreases continuously in both directions with the electrical positions, the spectral component propagates through the electrical wiring.2) Propagation along the physical position of the circuit: If the weight H decreases with the physical position, the spectral component propagates through the helium or the mechanical connection between the magnets.Propagation in helium is usually limited to three physically close magnets, which are installed together in adjacent cryostats belonging to the same cryogenic cell.It is further possible that the instrumentation wires or power cables of nearby physical neighbors cause interference and generate noise, which also propagates along the physical position.
3) Artifact of the QDS measurement unit: Lastly, the propagation can depend on effects in the QDS measurement unit.One QDS measurement unit measures the voltage signals on one to three electrically close magnets and one reference magnet.
If the QDS measurement unit's input position affects the spectral component, it is assumed that the voltage signal is generated by the measurement unit's electronics, not present directly at the magnet [3], [4].In addition to the characteristics of the spectral components, several correlation parameters are considered to interpret the spectral components.The most relevant of these are the following.
1) The magnet manufacturer: Each of the three manufacturers used slightly different materials and fabrication methods to produce the magnets, which results in slightly different magnet behavior.
2) The sector number: The circuit layout and hardware component manufacturers vary slightly across each of the eight sectors.
3) The amplitude and the ramp rate of the circuit current at the time of the FPA trigger: These one relate to the stored energy in the circuit and affect the voltage amplitude after the triggering of the FPA.The ramp rate dI/dt refers to the change of the current over time.4) The FPA event type: The presence of a magnet quench during the analyzed time period affects the frequency spectra significantly.The activation of the QHs induces new voltages, and the diode opening leads to additional transient voltages visible in the frequency spectra.In the results section, the listed propagation types and correlation parameters are determined and relationships are established for each spectral component to identify the spectral components' underlying potential physical processes.

E. Anomaly Detection
FPA events are abnormal when a frequency spectrum cannot be reconstructed well with the learned spectral components.For this purpose, for each FPA event q = 1, . .., Q the maximum reconstruction loss dq over all signals in the FPA event is calculated.This reconstruction loss depends on the chosen hyperparameters.Those are the combinations of distance measure d * (•), the number of spectral components K, and the selected FFT window functions.Hence, anomalies with abnormal behavior are estimated across all hyperparameter combinations.
To make the maximum reconstruction loss dq of different combinations of hyperparameters comparable, the probability distribution over dq is calculated.The probability distribution over dq for each hyperparameter combination is assumed to be gamma distributed where Γ is the Gamma function.The parameters α and θ are derived through a maximum likelihood estimation [42] to fit the distribution of the maximum reconstruction losses over all FPA events for one combination of hyperparameters.
For this distribution, a p-value z of an event can be defined by the probability of obtaining a d q at least as high as the observed dq The distribution fit is performed for all possible combinations of hyperparameters.Therefore, each event has as many p-values as there are combinations of hyperparameters.Abnormal events are then considered as those for which the median of all p-values is low.This yields an anomaly identification strategy that is robust to the choice of hyperparameters.
As mentioned earlier, the 48 events without a quench and the 494 events with a single quench are considered for anomaly detection.Another restriction concerns the selection of distance measures.Here, the IS-divergence is not considered for anomaly detection.Anomalies that indicate critical hardware faults are expected to have dominant amplitudes, but the IS-divergence compares relative amplitudes.The IS-divergence is, however, still relevant for identifying components explaining physical processes behind anomalies.Thus, for anomaly detection, only two distance measures are used: the Euclidean distance and KL-divergence.Together with the seven FFT window functions and the 19 different numbers of spectral components, 266 combinations of hyperparameters are used for anomaly detection.

F. Anomaly Interpretation
Anomalies are interpreted by considering the spectral components' weights in the FPA voltages.If the weights of certain spectral components are higher in an FPA event with high reconstruction loss, these spectral components might be associated with the anomaly.Thus, with the knowledge gained in Section II-D, also the physical process underlying the spectral component can be attributed to the anomaly.This is used to check whether the anomaly is pointing at a hardware problem.

A. Spectral Component Selection
Contrary to anomaly detection, all 399 combinations of hyperparameters and all 699 FPA events, are used to derive the spectral components for interpretation.Fig. 5 shows the resulting mean pairwise Chebyshev distance dCh of the (a) spectral components W and their (b) corresponding weights H, as a function of the number of spectral components K for the three distance measures in (3)- (5).The average dCh values across seven distinct FFT windows are shown, with the first and third quartiles forming the lower and upper confidence intervals, respectively.In the range of around seven components, there is a local maximum visible, except for the Eu-distance and KL-divergence in Fig. 5(a).Although the dCh curves in Fig. 5(a) further increase after this extreme point, they decrease in Fig. 5(b).The IS-divergence shows the best performance for less than eleven spectral components in Fig. 5(a) and for more than four spectral components in Fig. 5(b).In these regions, the effect of considering relative amplitudes by the IS-divergence is reflected: also, frequencies with small amplitudes are well reconstructed, which increases the diversity of the spectral components and their weights.A similar behaviour is observed for the KL-divergence.
Compared to the number of spectral components or the distance measure, the impact of the FFT window function on dCh is lower, as shown by the relatively narrow confidence intervals in Fig. 5(a) and (b).To choose the ideal FFT window function, the FFT window function with the highest dCh at the local maxima of the curves at K = 7 is selected.At K = 7, the Chebyshev distance is highest if the FFT is calculated using a Hanning window function (not shown in the plot).Hence, frequency spectra, derived with a Hanning FFT window function and reconstructed with seven components, are shown in more detail in Fig. 6.
To illustrate the propagation of frequencies in an FPA event, frequency-position maps (FPMs) are used.An example of such an FPM is shown in Fig. 6(a), where the frequencies occurring in the 154 voltage signals, measured [0.2; 0.575] seconds after the triggering of the FPA, are plotted as a function of the electrical position for the FPA event on March 31, 2021 in sector 78.The FPM in Fig. 6(a) shows the processed and Fourier-transformed voltage signals from Fig. 2 as frequency spectra values v i,j .During this event, a quench occurred at the electrical position 141 [white solid arrow in Fig. 6(a)].The electrical positions 14 and 15 are the physical circuit neighbors of the quenched magnet [white empty arrow in Fig. 6(a)].
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.show electrical positions and frequencies with a reconstruction difference.The darker the color of the points, the larger is the reconstruction difference.If the reconstruction is identical to the input and the reconstruction difference is zero, the plots would be completely white.
In Fig. 6(b), small reconstruction differences are visible in the low-frequency range for the Eu-distance.Voltage amplitudes in this range are generally higher.Reconstructions optimized with the Eu-distance, demonstrate superior performance in reconstructing lower frequencies as compared to the two other measures.At 110, 150, and 180 Hz significant reconstruction differences are visible by dark spots, highlighted by dashed ellipses.At these frequencies, the amplitudes are lower and are therefore not taken into account by the Eu-distance.
In comparison, for the KL-divergence more significant reconstruction differences are visible by the dark spots in the low-frequency range in Fig. 6(c).Instead, the reconstruction differences at 150 and 180 Hz are smaller than for the Eudistance measure.This can be explained by the fact that the KL-divergence reflects the relative entropy.Hence, instead of reconstructing the low frequencies with high amplitude more accurately, the entropy is optimized by reconstructing frequencies with lower amplitudes as well.
In Fig. 6(d), the IS-divergence leads to reconstructions where both 110 and 150 Hz oscillations have been captured, as there are no significant reconstruction differences visible there.The scaleinvariance [34] of the IS-divergence makes it ideal to reproduce frequencies with low amplitude.However, the low-frequency range, where significant reconstruction differences are visible, has been reconstructed poorly by the IS-divergence.
Based on the interpretation of Fig. 6 and further visual inspections, four Eu-distance components and three ISdivergence components that capture the frequency spectra best were selected for further analysis.Fig. 7 shows how the selected spectral components are used to reconstruct the frequency spectra in the voltage signal, measured [0.2; 0.575] seconds after the triggering of the FPA event.Fig. 7(b)-(h) shows the contribution vi,j of each of the seven selected spectral components j to the reconstruction of the frequency spectra v i,j of the FPA event in Fig. 7(a).In all FPMs, the frequencies and amplitudes of the voltage signals are shown as a function of the electrical position.The amplitude is displayed logarithmically as a color in the range 10 −4 V to 10 x V , where x is the maximum amplitude of the spectral component in this event.This x is indicated in the caption of each FMP with the spectral component number j.For the reconstruction of the subspectra with the different components, the following can be observed: high amplitudes in the low-frequency range, with their maxima at (b) 3 Hz, (c) 6 Hz, (d) 20 Hz, and (e) 66 Hz, are reconstructed by the Eu-distance spectral components.Lower amplitudes in the high-frequency range are reconstructed with the IS-distance spectral components, having their maxima at (f) 150 Hz and (g) 478 Hz.Lastly, a broadband spectrum, spanning vertically over the whole frequency range, can be reconstructed by the IS-distance spectral component (h).Due to the additive nature of NMF the Eu-distance reconstruction in Fig. 7(b)-(e) and the IS-divergence reconstructions in Fig. 7(f)-(h) can be added to reconstruct the input in Fig. 7(a).The propagation and physical processes of each spectral component are discussed in the next subsection.

B. Spectral Component Interpretation
Seven spectral components have been identified and are considered important to describe the overall frequency response of the LHC's main dipole circuits during FPAs.In the following, their characteristics and the potential underlying physical processes are discussed one by one.A summary of the discussed spectral components is given in Table I, where the columns show the characteristics described in Section II-D.
1) Spectral component one (SC1) is visible in Fig. 7(b) in the bright horizontal frequency band at 3 Hz.There are particularly bright spots at positions 15 and 141, For better visibility, the maximum of the color axis is scaled with 10 x V .Additionally, the frequency range is restricted to 0-220 Hz, in which the majority of the spectral components occur.

TABLE I CHARACTERISTICS OF SPECTRAL COMPONENTS
Authorized licensed use limited to the terms of the applicable license with IEEE.Restrictions apply.
which have a different explanation than the remaining bright spots.The magnets at positions 15 and 141 are the physical and electrical neighbors of the quenched magnet, respectively.Considering all FPA events with a quench, the average maximum amplitude at the physical and electrical neighbors of the quenched magnet is 62 mV.In FPA events without a quench, the average maximal amplitude is 5 mV.It can be concluded that the physical process causing SC1 is the quench of a magnet.It can be assumed that the quenching magnet causes electromagnetic perturbations, which are propagating through instrumentation wires and the connected QDS measurement units.Interestingly, the 3 Hz frequency amplitude is one order of magnitude larger if the quenched magnet was produced by manufacturer one.These perturbations are important because they are most likely the origin of high-energy secondary quenches in neighboring magnets tens of milliseconds after the primary quench occurred.The remaining bright spots are introduced in the preprocessing steps.Deviations of the signal x from the exponential trend x [see (1)], are interpreted as oscillation by the FFT.These oscillations are part of SC1 but do not originate from a physical process.
2) Spectral component two (SC2) is visible in Fig. 7(c) by two bright points at 6 Hz at the electrical positions around 15 and 141.These are the locations of the physical and electrical neighbors of the quenched magnet.As for SC1, it can be assumed that the physical processes causing SC2 are electromagnetic perturbations induced by the quenched magnet.This assumption is supported by the fact that the average maximum voltage of FPA events is 36 mV, while for events without quench, the average voltage is 100 µV.
Similarly to SC1, the frequency amplitudes are one order of magnitude larger if the quenched magnet was produced by manufacturer one.3) Spectral component three (SC3), in Fig. 7(d), shows a similar pattern to SC2.Bright spots are visible at the physical and electrical neighbors of the quenched magnet.However, additional physical processes appear when comparing different events where either no quench occurs, a single quench occurs, or a diode opened due to a secondary quench in the analyzed time window.
In events with an additional diode opening during an EE plateau, the average maximum amplitudes are 14 mV, where the values show little variance.Here, the diode opening induces a 20 Hz oscillation in the circuit, which is constant across all magnets.No diode was opened during the FPA event shown in Fig. 7(d), which is why the bright spots are not visible across all electrical positions.In events with a single quench, the average maximum amplitudes are highest at the physical and electrical neighbors, with 4 mV.This effect can be seen in Fig. 7(d).
Similar to SC1 and SC2, the physical process causing SC3 is electromagnetic perturbations originated by the quenched magnet.
In events with no quenches the amplitude of SC3 is highest at the magnet close to the PC with an average maximum amplitude of 1 mV.Here, the amplitude gradually decreases and is lowest at the first EE system.This can be traced back to the quench-independent leftover voltage waves traveling along the chain of magnets as governed by the magnet impedance.This effect is observed for all events and is proportional to the amplitude and the ramp rate of the circuit current at the moment of the FPA trigger [17].In Fig. 7(d), the frequencies caused by the quench are more prominent, making this process not observable in the given color range.4) Spectral component four (SC4) is visible in Fig. 7(e) and shows a bright spot at 66 Hz at positions 14 and 15.This shows that the locations of its voltage maxima are in the physical neighbors of the quenched magnet.From there, the bright spot is gradually getting darker in both directions, indicating that the oscillation is propagating along the electrical direction.A similar pattern is observed at 184 Hz, but in a darker color.In addition, SC4 is high at 302 Hz.The amplitude of these approximate 3rd and 5th harmonic scales is indirectly proportional to the number of the nth harmonic.
While the exact physical process of SC4 remains elusive, it is expected that it is emphasized by a quench.This expectation is supported by comparing FPA events with and without a quench.In events without a quench, the average maximum amplitude is two orders of magnitude lower than for events with a quench (73 mV versus 730 µ V). 14 and 15, and one line with low amplitude at the electrical positions around 140 and 142.Both lines have interruptions at frequencies already reconstructed by other spectral components.These vertical lines indicate a broadband spectrum in magnets physically close to the quench.In the time domain, this broadband spectrum corresponds to a spike.In a previous analysis of faults in a subsystem of the LHC protection system, such spikes were used as indicators for intermittent short circuits [19].Hence, SC7 might also be a critical indicator of an intermittent short circuit in the magnet.
In FPA events with a single quench, the average maximum amplitude of this broadband spectrum is 1 mV.For FPA events without quenches, the average maximum amplitude of SC7 is significantly smaller (<100 µ V).This shows that SC7 depends on the quench.A different physical process is observed during events with secondary quenches, where the average maximum amplitude is 4 mV if the QHs of one of the additionally quenched magnets are activated during the EE plateaus.In this case, SC7 propagates along the physical and electrical positions of all magnets in the circuit and is likely induced by the inductive part of the QH strips.No QHs were activated during the FPA event shown in Fig. 7(d), which is why this physical process cannot be observed there.

C. Anomaly Detection
In this subsection, the selected anomalies from FPA events with low median p-values following the definitions given in Section II-E are presented.FPA events with low median p-values are labeled based on an LHC-specific four-digit identifier of the quenched magnet, marked by a "#."Fig. 8 shows a boxplot, a conventional method to illustrate statistical data characteristics [43], of the ten FPA events with the lowest median p-value.For each anomalous FPA event, the quenched magnet is given on the x-axis and a box represents the statistical distribution of p-values over 266 different combinations of hyperparameters (the IS-divergence is not used).The box represents the range between the first and the third quartile, where the line in the middle represents the median.The outer limits further indicate the variability of p-values, which are obtained by subtracting 1.5 times the box length from the first quartile and adding it to the third quartile, respectively.The y-axis is plotted logarithmically, therefore, the first quartile's outer limit is cut, if it is zero.
The four FPA events with the quenched magnets #2038, #1225, #1146, and #1291 state a median p-value smaller than 1%.In the boxplot for those FPA events, the first and third quartiles are below the 99% confidence interval, and the outer box limits are below 95%.For the six other FPA events in Fig. 8, both quartiles and outer limits are above their respective 99% and 95% intervals.Based on this classification, only the four events with a median p-value of less than 1% are therefore referred to as anomalies.In the following subsection, these anomalies are discussed in more detail.

D. Anomaly Interpretation
To better understand the characteristics of the anomalies, the spectral component weights in the anomalous FPA events are compared to those in normal FPA events.Notably, the weights of SC7 are elevated in three of the four identified FPA events.As discussed above, SC7 represents a broadband spectrum and has an average maximum amplitude of 1 mV for FPA events with a single quench (see Table I).However, with 820 µ V, only the maximum amplitude in the anomalous FPA event where the magnet #1146 quenched is similar to this average.For the FPA events where the magnets #2038, #1225, and #1291 quenched the amplitudes of this spectral component are 240, 80, and 210 mV, respectively.These amplitudes are more than 80 times higher than the maximum average amplitude for other FPA events.In no other FPA event with a single quench, the values are this high.It is inferred that a high SC7 in the quenched magnet is a sufficient criterion for identifying an anomaly.
Having identified the criticality of a high SC7 amplitude in the quenched magnet, the previously excluded 157 FPA events with secondary quenches are also examined for this characteristic.One FPA event with an amplitude of 1200 mV, more than 1200 times higher than the average maximum, stands out.Therefore, this FPA event, where a quench occurred at magnet #2421, is also referred to as an anomaly in the course of this analysis.
In magnet #1146, the SC7 amplitude is not elevated during the quench.Instead, the low p-value in this FPA event is associated with a high SC1 amplitude of 1500 mV in the physical neighbors of the quenched magnet.As discussed in Section III-B, this component represents electromagnetic perturbations propagating through instrumentation wires and the connected QDS measurement units.Based on current knowledge, these electromagnetic perturbations are not associated with a hardware fault.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

E. Recommended Maintenance Actions
One of the four quenched magnets, with a significantly increased SC7 amplitude during an FPA event, has developed an intermittent short circuit during the FPA event on April 25, 2021.As such, an intermittent short circuit is a critical event, the other three magnets, #1225, #1291, and #2421, are also treated as potentially critical and will be checked by transient voltage measurement.If an intermittent short circuit cannot be excluded during the transient voltage measurement, these magnets could be replaced in one of the next maintenance stops on the LHC.In any case, the electronics of the QDS measurement units of these magnets should be exchanged in order to exclude measurement errors.
Transient measurements will also be performed on magnet #1146.These measurements will provide further information about electromagnetic perturbations.
Table II summarizes the five discussed anomalies, sorted by their median p-value in the first column.The second column shows the quenched magnet, the affected circuit, and the date of the related FPA event.The third column summarizes main findings of the FPA event and states the recommended maintenance actions.

IV. CONCLUSION
In this study, the voltages measured across the 1232 magnets in the eight LHC main dipole circuits were analyzed to understand the normal and abnormal behavior of the circuits.Specifically, the amplitude and propagation of the frequency spectra measured at the magnets in 699 FPA events were investigated using NMF.
This allowed the extraction of seven spectral components that define normal behavior, occurring in the measured voltages during an FPA event.Analyzing the spectral components' distribution and propagation across the circuit and across FPA events provided a deeper understanding of the mutual interaction of hardware components and allowed identifying the potential physical processes causing the spectral components.It was shown that spectral components one to three, with maxima at 3, 6, 20 Hz, are induced by the quench due to electromagnetic perturbations.Their amplitudes are one order of magnitude higher when the quenched magnet was produced by manufacturer one.SC4 shows a 66 Hz oscillation induced by the quench.With maxima at 150 and 478 Hz, components five and six are independent of the quench and were attributed to artifacts of the QDS measurement unit, and to passive elements in the PC of one individual circuit.SC7 shows a broadband spectrum, induced by the quench.As previous studies showed that such broadband spectra can be an indicator of short circuits [19], SC7 could also indicate an intermittent short circuit in the magnet.
Five magnets with abnormal behavior during FPA events were detected using the reconstruction loss of NMF and the SC7 amplitude at the quenched magnet.One of these magnets was replaced on April 25, 2021 after a short circuit was detected following the FPA event.Similar to the replaced magnet, three of the four remaining magnets showed an elevated SC7 amplitude during their quench, which is more than 80 times higher than normal.Dedicated transient measurements will be performed on these magnets and the electronics of their QDS measurement unit should be replaced.If an intermittent short circuit still cannot be excluded, the three magnets could be replaced in one of the next maintenance stops of the LHC to prevent weeks of unplanned LHC downtime.In the magnet, which did not show a high SC7 during the FPA event in the quenched magnet, data do not indicate a hardware fault.Instead, an abnormally high SC1 was observed, which will also be evaluated by transient measurements.The presented methodology has proven to be a powerful tool to describe the normal behavior of the circuits systematically and to detect abnormal behavior indicating potential hardware fatigue and degradation of hardware components in the circuit.

Fig. 2 .
Fig. 2. Voltages U M across the 154 main dipole magnets of sector 78 following a quench in the magnet with the electrical position 141 on March 31, 2021 with its different phases.(a) FPA is triggered at 0 s, the QHs for magnet 141 are activated, and the PC is deactivated.(b) After around 0.03 s, the bypass diode of the quenched magnet becomes conductive.(c) First EE system EE 1 is activated about 0.1 s after the FPA trigger.(d) Second EE system EE 2 is activated approximately 0.5 s after the first one.The blue curve shows the voltage across the quenched magnet, while the remaining curves represent the voltages across the other 153 magnets of the circuit.The voltage signals of one of the analyzed plateaus are shown in the magnified view on the right top corner.

Fig. 3 .
Fig. 3. Example of the NMF decomposition V ≈ WH.In this example, M = 3 frequency spectra with N = 30 data points are approximated with K = 2 spectral components.The color spectrum indicates the voltage amplitudes and ranges from black 0 to white 1.The vertical index count i starts at the bottom.

Fig. 5 .
Fig. 5. Mean pairwise Chebyshev distance dCh for (a) the spectral components W and (b) their weights H as a function of the number of extracted spectral components K. Compared are the NMF distance measures: The squared Eu distance, the generalized KL divergence, and the IS divergence.The curves indicate the mean over seven different FFT windows, with the first and third quartiles defining the lower and upper confidence intervals, respectively.

Fig. 6 .
Fig. 6.Comparison of three different distance measures with an FPM.An FPM shows the frequency and amplitude of the voltage signal as a function of the electrical position.(a) FPA event with a quench in sector 78 on March 31, 2021 with an FPM.The solid white arrow marks the quenched magnet, while the empty arrow marks its physical neighbors.This FPA event was reconstructed with K = 7 and (b) the Eu-distance, (c) the KL-divergence, and (d) the IS-divergence.To facilitate visual comparison, the values from the input FPA event in (a) are subtracted from those of the reconstructed events.The absolute values of the subtractions are shown as an FPM in (b)-(d).Circled areas highlight incomplete reconstructions.

Fig. 6 (
Fig. 6(b)-(d) shows the absolute difference |v i,j − vi,j, * | between the NMF reconstructions vi,j, * and the input v i,j for the different distance measures discussed above.Only frequencies below 220 Hz are included for better visibility.The example aims at comparing the reconstructions for Eu-distance, KL-divergence, and IS-divergence using K = 7 and a Hanning FFT window function.The colored spots in the FPMs [see Fig. 6(b)-(d)]show electrical positions and frequencies with a reconstruction difference.The darker the color of the points, the larger is the reconstruction difference.If the reconstruction is identical to the input and the reconstruction difference is zero, the plots would be completely white.In Fig.6(b), small reconstruction differences are visible in the low-frequency range for the Eu-distance.Voltage amplitudes in this range are generally higher.Reconstructions optimized with the Eu-distance, demonstrate superior performance in reconstructing lower frequencies as compared to the two other measures.At 110, 150, and 180 Hz significant reconstruction differences are visible by dark spots, highlighted by dashed ellipses.At these frequencies, the amplitudes are lower and are therefore not taken into account by the Eu-distance.In comparison, for the KL-divergence more significant reconstruction differences are visible by the dark spots in the low-frequency range in Fig.6(c).Instead, the reconstruction differences at 150 and 180 Hz are smaller than for the Eudistance measure.This can be explained by the fact that the KL-divergence reflects the relative entropy.Hence, instead of reconstructing the low frequencies with high amplitude more accurately, the entropy is optimized by reconstructing frequencies with lower amplitudes as well.In Fig.6(d), the IS-divergence leads to reconstructions where both 110 and 150 Hz oscillations have been captured, as there are no significant reconstruction differences visible there.The scaleinvariance[34] of the IS-divergence makes it ideal to reproduce frequencies with low amplitude.However, the low-frequency range, where significant reconstruction differences are visible, has been reconstructed poorly by the IS-divergence.Based on the interpretation of Fig.6and further visual inspections, four Eu-distance components and three ISdivergence components that capture the frequency spectra best

Fig. 7 .
Fig. 7. Frequency and amplitude of the identified seven spectral components as a function of the electrical position.(a) FPM of the frequencies occurring in the voltage signal, measured [0.2; 0.575] seconds after the triggering of the FPA of sector 78 on March 31, 2021.(b)-(h) show vi,j for each spectral component j, used to reconstruct the initial FPM in (a).For better visibility, the maximum of the color axis is scaled with 10 x V .Additionally, the frequency range is restricted to 0-220 Hz, in which the majority of the spectral components occur.

5 )Fig. 8 .
Fig.8.Boxplot of the 10 FPA events with the smallest median p-values.For each event, the quenched magnet is given on the x-axis.Each box extends from the first to the third quartile, with a horizontal line at the median.The lines extending from the box further show the variability of p-values.They cover a range of 1.5 times the box length, either subtracted from the first quartile or added to the third quartile.The orange dashed line indicates the confidence interval of 95% at a p-value of 0.05, while the red vertical line indicates the 99% confidence interval at a p-value of 0.01.

TABLE II LIST
OF DETECTED ANOMALIES WITH RECOMMENDED MAINTENANCE ACTIONS IN THE REMARKS COLUMN