Principal Components of Neural Convolution Filters

Convolutions in neural networks are still essential on various vision tasks. To develop neural convolutions, this study focuses on Structured Receptive Field (SRF), representing a convolution filter as a linear combination of widely acting designed components. Although SRF can represent convolution filters with fewer components than the number of filter bins, N-Jet, the sole component system implementation, requires ten trainable parameters per filter to improve accuracy even for $3 \times 3$ convolutions. Hence, we aim to formulate a new component system for SRF that can represent valid filters with fewer components. Our component system named “OtX” is based on the Principal Component Analysis of well-trained filter weights because the extracted components will also be principal for neural convolution filters. In addition to proposing the component system, we develop a component scaling method to defuse massive scale differences among the coefficients in a linear combination of OtX components. In the experimental section, we train image classification models on CIFAR-100 dataset under the hyperparameters tuned for the original models with the standard convolutions. For NFNet-F0 classifier, OtX with six components performs 0.5% better than the standard convolution, 3.1% better than N-Jet with six components, and only 0.1% worse than N-Jet with ten components. Besides, OtX with nine components provides stabler training than N-Jet, performing 0.5% better than the standard for NFNet-F0. OtX suits when replacing standard convolutions because OtX performs at least comparably against N-Jet with further parameter efficiency and training stability.


I. INTRODUCTION
Neural Networks (NN) are essential in vision tasks because of their outstanding performance. Convolutional Neural Networks (CNNs) are common forms mainly consisting of convolution layers. CNNs sequentially convolve feature maps and extract suitable features for each task. The versatility of CNNs has made them predominant in the image processing field. Recent studies actively develop non-CNN architectures because convolution is inadequate in merging two related but distant features in a calculation. Vision Transformer (ViT) [1] and its derivates consist of multi-head self attentions (MSAs) [2] and multi-layer perceptrons (MLPs). MSAs directly compare features while ignoring their distance and determine the value based on the comparison result.
The associate editor coordinating the review of this manuscript and approving it for publication was Chaker Larabi . MLP-Mixer [3] and its derivates adopt spatial MLPs instead of MSAs to capture long-range dependencies. Although this trend may seem to ostracize convolution layers from NNs, their importance is being re-acknowledged. Convolution, a simple aggregation of local features, performs better for extracting features than MSA, especially from less processed maps [4]. Besides, though convolution layers seem to not be used in ViT derivates, major patch aggregations are equivalent to the processes in convolution layers where the kernel size and the stride are equal to the patch size. Convolution layer classes are used in programming codes, excluding when adopting the standardization and the affine transform for each patch. Thus, even though the current trend focuses on capturing long-range dependencies, the convolution layer is an essential component in NNs for vision tasks.
Designing convolution filters is a way of developing convolution layers in neural networks. In this paper, the word ''filter'' means a two-dimensional convolution filter composed of weights for signals on every offset in its receptive field. A filter affects a channel in the input three-dimensional feature map and maps to a channel of the output feature map. Convolution filter design (CFD) aims to develop convolution filters in hopes of improving the quality or reducing the trainable parameters by modifying the structure of a filter from an array of trainable parameters. This study is a CFD work.
Let us first define variables for designing filters. A convolution layer convolves an input feature map X ∈ R C in ×D 1 ×D 2 with the height D 1 , the width D 2 , and the number of channels C in . Assuming that downsampling with strides larger than one is done after the convolution as needed, the output of the convolution Y has C out ×D 1 ×D 2 bins. Although the numbers of channels C in and C out are fixed for each convolution layer, the spatial resolutions D 1 and D 2 are arbitrary. Then, the filter weights consist of C out C in filters with a particular kernel size of K 1 × K 2 . With these variables, a convolution is described as: where c out and c in are the indices for the output/input channels, h and w are the vertical/horizontal positions of the maps, and i and j are the vertical/horizontal positions of the filter. For each pair (c out , c in ), a convolution is a weighted summation of X c in based on the importance of the position ( h i , w j ) away from (h, w). Note that the convolution is calculated under a value such as zero for offsets out of X . In this paper, each two-dimentional filter c out ,c in is regarded as a linear combination of two-dimentional filter components {F λ } λ=1,2,··· , : where θ ∈ R C out ×C in × is the array of coefficients for the filter components {F λ }. This representation is seen in [5]. The coefficients correspond to (sometimes scaled) the trainable weight parameters of a convolution layer. In the case that a filter is represented as an array of K 1 × K 2 parameters, each filter component F λ is described as: where δ ·,· is the Kronecker delta. Conventional convolution layers implicitly assume the standard bases for each filter, and the coefficients for the standard bases are optimized through training. This primitive approach makes every K 1 × K 2 filter representable. However, the effective area of each parameter is strictly restricted. This restriction decreases the worth of every parameter, making it highly dependent on its offset. On the contrary, Structured Receptive Field (SRF) [5] employs filter components where each can act as a meaningful filter. Introducing SRF turns the role of each parameter in convolution layers from the importance of the offset to the importance of its corresponding component. When a filter is a Gaussian filter, for example, the conventional components require cooperatively tuned parameters. In contrast, only one parameter is required when a system of components contains the Gaussian filter component. This example implies that a well-designed system will reduce non-zero parameters, making the optimization simpler and easier. Thus, how to construct filter components is essential, and this is the main topic of this study. N-Jet [6] is introduced as an SRF component system in [5] and is the sole component system formulation for SRF. Namely, all SRF applications have been implemented with N-Jet components. N-Jet components act as local differential operators of which the orders are m for the vertical axis and n for the horizontal axis. With the local differential filter components, N-Jet approximates local signals with the Taylor expansion and weights each polynomial signal components. Although extracting features from the Taylor expansion is a versatile approach in Natural science, N-Jet still have two problems as a system of SRF components. The first problem is SRF on N-Jet performs worse than the conventional convolution when training on sufficiently large-scale datasets such as the ImageNet dataset [7]. This problem critically, which is not actually treated in this paper, shows there is a room of investigating another component system for SRF. The second problem is N-Jet components are highly correlated one another. Although this nature guarantees N-Jet practically requires at most fifteen components, the nature also detracts the efficiency of each component. For example, N-Jet on SRF with six components cannot outperform the conventional convolution. Composing an SRF with more efficient and less correlated components will make training easier and improve performance with less components. This paper proposes a new SRF component system, ''OtX,'' 1 as the second formulation of SRF. OtX is formulated by modeling implicit principal components of well-trained neural convolution filters. Thus, OtX has orthogonality, in other words, the components of OtX are not correlated one another. We also define the rule for the ordering of OtX components, which contributes to picking a finite number of efficient components to train. Since OtX reveals efficient components to characterize neural convolution filters, SRF on OtX requires fewer components than on N-Jet to outperform the standard representation. OtX is also formulated based on the Hermite polynomials and Gaussian function similarly to N-Jet. The main change is the employment of radial symmetric and π 4 rotated line-symmetric components. This change accepts inseparable components in contrast to N-Jet components which are all separable. In addition, we propose a component scaling to make training of SRF on OtX easier. Whereas a single use of the two does not perform well, the combined use of OtX and the component scaling generates a 1 The name ''OtX'' is only a sequence of symbol characters indicating three symmetry types. The character ''O'' symbolizes radial symmetry, ''t'' symbolizes line symmetry ϕ = 0, π 2 , and ''X'' symbolizes line symmetry ϕ = π 4 , where ϕ is introduced in Section IV. VOLUME 10, 2022 synergistic effect. The component scaling makes OtX comparable with N-Jet as an SRF component system. Summarizing the description above, the contributions of this paper are as below.
• We analyze the implicit principal components of neural convolution filters and reasonably formulate them. In the formulation, we also generalize the rule for the ordering of components.
• We apply the formulated OtX system as a new candidate for SRF components, and OtX provides a denser filter representation than N-Jet. Namely, training with OtX can obtain better filters with fewer component candidates than with N-Jet.
• We propose a component scaling method that helps the training of OtX filters.

II. RELATED WORKS
This work aims to improve neural convolution layers. In this section, we introduce related works that have phrasally similar purposes. Not all related works conflict with this study, and our proposals can be applied to some of these works. Works that can be combined with OtX are introduced in the latter part of this section. Neural convolution filter development that conflict with this work is categorized into two types. Works of the first type align trainable parameters symmetrically. A parameter for a convolution filter affects multiple offsets in the filter. Applying these components reduces the number of trainable parameters. A work [8] develops horizontally symmetric alignments of parameters. SymKer [9] replicates a unit of parameters to form a filter. SymNet [10] uses three types of radial symmetric filters. Filters in these works are not spatially biased. However, the available shapes of filters in these works are too restricted to maintain the performance of neural networks. Hence, the works conclude that their proposals can reduce trainable parameters without salient performance degradation. Besides, these works cannot keep the number of parameters small for large receptive fields because the number scales linearly with K 1 K 2 . On the contrary, works of the second type succeed in small-scale experiments, and the number of parameters is independent of receptive field sizes. These works design filter components of the linear combination representation and train the coefficients. N-Jet [6] is introduced as the first implementation of component system (originally called ''bases'') for Structured Receptive Field [5]. N-Jet components are tensor dot products of the vertical and the horizontal Gaussian derivatives. Precisely, a continuous N-Jet component J m,n of which the order for x is m, and that for y is n is defined as: where x and y denote the offsets for the horizontal and the vertical directions, respectively. A Gaussian derivative is correlated with another which has the same order parity. Thus, available pairs of Gaussian derivatives for an N-Jet component are limited to ones such that the total orders are up to 4. N-Jet components are similar to { (0) m,n } in OtX, but OtXs' are not correlated with one another because of their orthogonality. SRF can choose an upper limit of the total order of two Gaussian derivatives, and the number of parameters for a filter can be 1, 2, 6, 10, or 15. This limitation causes difficulty in reducing the total number of trainable parameters for 3 × 3 convolutions which are the most widely used. To reduce parameters fewer than nine, OtX can keep at most eight components, whereas N-Jet must reduce components to six. N-Jet on SRF outperforms standard convolution filters on smaller scales than ImageNet classification with 1000 image classes. FracSRF [11] forms a filter as a tensor dot product of two approximated fractional derivatives of Gaussian. FracSRF needs three parameters per filter, derivative orders for the vertical and the horizontal directions and the scale of the tensor dot product. FracSRF performs comparably against SRF despite fewer parameters. Because all filters in FracSRF are separable, applying OtX to FracSRF is difficult.
Not all N-Jet applications conflict with OtX. N-Jet Net [12] is an expansion of SRF such that the receptive field scales are changed, corresponding to the scale parameter of Gaussian. The nature that SRF components have non-zero values in a particular area of a receptive field enables this expansion. Since OtX components also have this nature, OtX can be expanded similarly to N-Jet Net. For the same reason, OtX can apply to SESN [13], which uses a filter on multiple scales. Note that components in SESN are implemented based on {ψ n } which are not Gaussian derivatives.
Weight standardization [14] is a powerful development for convolution weights. Weight standardization adjusts the mean and the variance of weight values for every output channel to be 0 and 1, respectively. Scaled weight standardization in [15] scales the standardized weights with trainable gain parameters defined for output channels of convolutions. Applying OtX to these causes a problem: subtracting the mean from the filters loses the role of the components. To avoid this, we slightly modify the way of standardizing weights. We divide weights by their second moment around 0 instead of their standard deviations without subtracting their means. This modified scaled weight standardization is used in our experiments' NFNet [16] implementation.
Deformable convolution [17], [18] has flexible offsets, reference positions for convolutions. The offsets are calculated by looking around each position. Then, standard convolution layers are used as the explorers. We can replace the standard convolution with OtX, which is a way of applying OtX to Deformable convolution.

III. ANALYZING IMPLICIT PRINCPAL COMPONENTS OF NEURAL CONVOLUTION FILTERS
In this section, we analyze convolution filter weights in a well-trained CNN and show our OtX design policy. The CNN model used for analysis is a VGG16 classifier [19] trained on the ImageNet dataset [7]. We use the weights data from a model zoo, PyTorch Image Models [20].  We show the result of Principal Component Analysis (PCA) [21], [22] of filter weights for each convolution layer in Figure 1. At a first glance, the principal components of each layer have almost a common characteristic. PCA is a way of extracting orthonormal bases from a set of vectors, and the bases are efficient for low-rank approximation of the vectors in the set. Note that the sign of each component is ignorable because the component is used for a linear combination. Bar graphs in Figure 1 visualize the standard deviations along with the components, and components with larger standard deviations can approximate with more minor errors. Namely, more left components in the figure are more critical for characterizing the filter weights of the convolution layers. This order is almost the same among all the layers, and it may imply the existence of a general efficient design of components for characterizing convolution filters.
Formulating and applying these components will bring some benefits. For example, optimizations with carefully selected components will reduce the number of trainable parameters without significant deterioration. Thus, the remaining part of this section analyzes the PCA results and designs the OtX formulation policy.
(1) Every component is symmetric around its center point or two lines through its center. (2) Absolute values of outer points in a component tend to be smaller than that of inner points except for symmetrical axes of the component, which have odd symmetry. This result follows an intuition that signals in closer positions are more critical since convolution is a way of aggregating neighboring information. The decay rates of weights become smaller as the layer becomes far from the input. Thus, this decay rate should be adjustable for the layers. (3) Some components adjoin the rotation of themselves. There exist two types of angle differences: π 2 and π 4 . Components with odd symmetry along either the horizontal or the vertical direction are the former type, and pairs of components where one of them has odd symmetry in both the horizontal and the vertical directions are the latter type. The latter component has almost the same odd symmetry in two directions. On the contrary, some components have no pair and accord to their π 2 rotation. These kinds of components can be regarded as radial symmetric components, which are isotropic, and each weight is determined by the distance from the center of the component. (4) The standard deviations of components seem to form a Gaussian distribution. Closer layers to the input have slower decay about the standard deviations. Notably, the ninth components at most have approximately 1 10 standard deviations than the first components, and their contributions will be slight.
Let us summarize the above discussions. Each component is orthogonal to each other, and farther positions from its center have small absolute values. The components are categorized into two types, radially symmetric or line-symmetric, and each of the latter type filters has its pair which accord to itself when rotated. A line-symmetric component has at least one odd symmetric direction, and the rotation angle for the overlap is π 2 if it has only one or π 4 if it has two. The two symmetries are the same for a line-symmetric component with two odd symmetries. In the next section, we concretely formulate filter components that have these natures.

IV. FORMULATION
In this section, we formulate filter components and their priorities. First, we introduce one-dimensional orthogonal Hermite-Gaussians {ψ n } and note its basic natures. Second, we extend {ψ n } into two-dimensional functions in two ways.

A. ONE-DIMENSIONAL ORTHOGONAL HERMITE-GAUSSIANS
In this paper, we use ''physicist's Hermite polynomials.'' Let n be a non-negative integer. Then, the n-degree Hermite polynomial H n (x) is defined by a recurrence relation where H 0 (x) = 1 and H 1 (x) = 2x. Hermite polynomials {H n (x)} have orthogonality with a weight function exp(−x 2 ), namely Thus, if an n-degree one-dimensional Hermite-Gaussian {ψ n } is an orthogonal system of functions, we have Note that N-Jet components [5] are constructed based on Gaussian derivatives ∂ n ∂x n exp − x 2 , and the function system does not have orthogonality. Figure 3 shows graphs of normalized ψ n (x) and Gaussian derivatives for comparison.
If n is even, ψ n is an even function, and if n is odd, ψ n is an odd function. Therefore, ψ n (x) is either symmetric or antisymmetric about x = 0. This symmetric nature guarantees that filter components' weights are aligned symmetrically.

B. ORTHOGONAL SYSTEM OF HERMITIAN FILTER COMPONENTS 1) EVEN DEGREE RADIAL SYMMETRIC FILTER COMPONENTS
We introduce a two-dimensional filter component system { n } by rotating ψ n around the origin of an R 2 plane. n is defined if and only if n is even because each n must be the same when rotating it by π . We assume an x-y plane such that its origin is the center of filter components, and the shorter side of the components corresponds to [−1, 1]. Then, let where σ is a trainable parameter that represents the spatial scale of components. Each convolution layer has one σ . For larger σ , filter components are bounded by the filter boundary, and for enough small σ , the components can have sufficient non-zero parts. Components shown as (a), (d), and (i) in Figure 2 are instances of n . Note that a 0 is a Gaussian filter component. If n ≥ 2, each n cannot be represented as (ϕ) m,n , because n is not separable for n ≥ 2. In addition, { n } is an orthogonal system of L 2 (R 2 ) as straightforwardly derived from the orthogonality of {ψ n }. Therefore, each n can be a component obtained from a PCA if we ignore region truncation errors.

2) LINE SYMMETRIC FILTERS
We assume the same x-y space as Section IV-B1. Let a line-symmetric filter component where (x ϕ , y ϕ ) is a mapped point from the original position (x, y) by rotating it by ϕ around the origin of the x-y plane. Namely, There also exist conditions for available (m, n, ϕ) combinations of { (ϕ) m,n } as well as n of { n }. Note that these are not limitations derived from the nature of ψ n but conditions so that the components match the analysis result in Section III. Available (m, n, ϕ) combinations are elements of a set The available combinations of m and n are significantly limited to n = m, m + 1, and only two ϕ, which equals 0 or another value, are defined for each relationship. The existence of n will prohibit the combination with even m = n. If m = n is odd, the relationship holds, and that is the reason why ϕ = π 4 is used for odd m = n combinations. Components shown as (b), (c), (e), (f), (g), and (h) in Figure 2 are instances of (ϕ) m,n . A system { (ϕ) m,n } for available (m, n, ϕ) combinations is an orthogonal system, and each line-symmetric component in the system has orthogonality with all radial symmetric components. Therefore, elements of { (ϕ) m,n } can be components of a PCA result simultaneously with { n } components if we ignore region truncation errors.

C. ORDER OF FILTER COMPONENTS
In Section IV-B, we defined two types of filter component systems { n } and { (ϕ) m,n } as candidates for each F λ in Eq. 2. In practice, we must pick finite components from infinite candidates. Although this problem is generally to find the optimal combination, in this paper, we give a score for each component and greedily pick components that have smaller scores. We define the score as the total degree of one-dimensional orthogonal Hermite-Gaussians in the formulation of a component. The score is n for n and m + n for (ϕ) m,n . Note that we prefer radial symmetric components over line-symmetric ones with the same scores in principle. There is an exception in the case of 2 × 2 filters where the order of the components is 0 , 1,1 because the sampling results of 2 equals that of 0 . This ordering with the score aligns with the component order in PCA as shown in Figure 2.

D. SAMPLING FROM THE CONTINUOUS FILTER COMPONENTS
Proposed filters are no longer arrays of trainable parameters, but they are still K 1 × K 2 numerical arrays. Thus, continuous functions defined in Section IV-B must be sampled into K 1 × K 2 arrays. In our definition of the x-y plane, the smaller side of a filter corresponds to [−1, 1]. We pick sample points {(x j , y i )} for array indices {(i, j)} by regular intervals on the x-y plane. Namely, we have and where K = Min{K 1 , K 2 }. After calculating the filter component value for each (x j , y i ), the array is divided by its L 2 norm. This normalization guarantees that the array norm equals a steady value of 1 even if the component is bounded.

V. COMPONENT SCALING
The PCA result shows that components in a layer have quite different coefficient variances. Thus, we attempt to create such differences in the variances among components. While a coefficient is usually one trainable parameter, we represent a coefficient as a product of two kinds of trainable parameters. VOLUME 10, 2022  Specifically, where α c out ,c in ,λ is a trainable parameter almost compatible with θ c out ,c in ,λ and β λ is a trainable parameter defined for the λ-th component in a layer. The newly introduced parameter β λ adjusts the scales of coefficients for its corresponding component. This modification acts as a restriction such that coefficients, especially for components that do not have much effect, do not become too large. The increase in the number of trainable parameters is . Since the number of {θ c out ,c in ,λ } in a layer is C out C in , this increase is negligibly small. In the case of C out = 32 and C in = 3, although this is one of the worst examples, the increase rate is approximately only 1%. We initialize β λ with the equation where s λ is the score of the λ-th component. Scores for N-jet components are defined similarly to OtX in our ablation study.

VI. EXPERIMENTS
In this section, we keep the model structure of neural classification models but replace the convolution weights and compare validation results. Getting a better result from a method indicates that the method can find better convolution weights. Learning rates for training are tuned for the original models that obtain their convolution weights as arrays of trainable parameters. We use hyperparameters tuned for the original even when training weights on other representations. All experiments are done on three random seeds, and we show the average value over the three trials.

A. SETTINGS
We use TIMM [20] implementations as baseline classifier models. For filter architectures except for the baseline, we initialize convolution filter parameters so that the convolution weights have the same variance as the baseline. We experiment with classifications on CIFAR-100 benchmark dataset [23]. Training images are resized to 192 × 192 pixels with distorted bounding box crops and taken random horizontal flips [19]. Validation images are center-cropped to 224 × 224 pixels after being resized to 256 × 256 pixels by bicubic interpolation. After the transformations, we standardize image values for each channel by the mean and the standard deviation of the channel over the dataset. Mini-batch size is set as 32, and we train the models for 90 epochs. The loss function is the cross-entropy loss with label smoothing [24], and the smoothing parameter is set to 0.1. We employ the standard stochastic gradient descent (SGD) optimizer with Nesterov's accelerated gradient method, of which the momentum is set as 0.9. Learning rates fluctuate under the cosine decay schedule after the linear warmup over 5 epochs. We apply a Sharpness Aware Minimization (SAM) [25] step once every 5 training steps.

B. COMPARISON AGAINST THE BASELINE AND N-JET
We compare the OtX architecture against the baseline and a modified N-Jet with trainable Gaussian scales. The initial filter scale σ is set to 1. We denote the modified N-Jet with components as N-Jet-, and a notation ''OtX-'' is defined similarly for the OtX architecture. We choose = 10 for N-Jet, which is the best , and = 9 for OtX, which uses almost the same number of trainable parameters as standard 3 × 3 filters. We experiment on four classification models; VGG16 [19], DenseNet-121 [26], EfficientNetV2-S [27], and NFNet-F0 [16]. Table 1 shows the results. Models with OtX convolutions outperform those with standard convolutions on the datasets where the models were able to be trained. Additionally, training on OtX is more stable than on N-Jet when hyperparameters are tuned for standard convolutions. Accordingly, standard convolutions can be replaced with OtX ones without modifying the hyperparameters. Employing N-Jet requires further hyperparameter-tunings even if good hyperparameters for the model are known. Despite the stability, OtX performs at least comparably with N-Jet. Thus, OtX can be a candidate for an SRF component system to obtain better filters without tuning any hyperparameters.

C. EFFICIENCY OF COMPONENTS
We train NFNet-F0 on N-Jet and OtX with various . This experiment intends to reveal the efficiency of components of the systems and the trade-off between the expansion of representable filter space and the optimization of filters by increasing the number of components. Table 2 shows the results. Whereas N-Jet-6 performs worse than N-Jet-10, OtX-6, which has the same number of components, performs almost as well as OtX-9 and N-Jet-10. This comparison indicates that OtX provides six more efficient components for neural convolutions than N-Jet. In the second sight for the trade-off, OtX-6 seems to have the best number of components, and what has a larger obtains worse filters. However, OtX-9 has two advantages over OtX-6. The first advantage is the max accuracy. The max accuracy with OtX-9 is 82.7 %, and that with OtX-6 is 82.4 %. This indicates that exploration of coefficients for better filters sometimes succeeds even in the case with a larger . The second advantage is the stability of the training. In the three trials we experimented, OtX-6 failed in training of one of the trials, while OtX-9 was trained stably in all three trials. Thus, OtX-9 is more suitable when one wants to obtain better filters through multiple trials or avoid training failures.

D. ABLATION STUDY
We do an ablation study to reveal what difference between N-Jet and OtX contributes to the performance. There are three main differences between N-Jet and OtX; orthogonal-ity of components, employment of radially or rotated linesymmetric components, and application of the component scaling. We do experiments on all combinations to test the efficacy of each property. Table 3 shows the results. Although each of the three properties influences the results, finding a general rule is difficult since none of the three improves the accuracies regardless of the other two properties. For example, employing the component scaling works well on OtX experiments, but it does not on N-Jet. Especially, comparing the results of Row 4 and Row 8 indicates that only applying new types of symmetry to N-Jet decreases the accuracy, and combined use with other two features, orthogonality and the component scaling, compensates the accuracy (Raw 5). Thus, the combination of the three is inseparable and may be suitable for finding better filters.

VII. CONCLUSION
In this paper, we design a new SRF component system named ''OtX'' for better optimization of neural convolution filters. OtX is a modeled formulation of the implicit principal components of neural convolution filters based on a PCA result of well-trained convolution filters. The designed components are radial or line-symmetric and approximately orthogonal to one another. Training convolution filters with OtX can find more efficient weights than the standard filters, which represented as arrays of trainable parameters, without tuning the hyperparameters of most of the original standard models. In the case of using the original's hyperparameters, OtX is more stable than the existing N-Jet, but further stability is required. Since OtX components are the principal ones of neural convolution filters, OtX is more suitable for representing filters with fewer components than N-Jet. More OtX components provide a stabler training and a higher probability of finding better filters. We also propose a component scaling for OtX that improves OtX optimization to a degree comparable to a successful trained N-Jet. Thus, our proposed method can be a useful SRF component system and can be applied even when applying SRF on N-Jet is difficult.