Study of Restrained Network Structures for Wasserstein Generative Adversarial Networks (WGANs) on Numeric Data Augmentation

Some recent studies have suggested using Generative Adversarial Network (GAN) for numeric data over-sampling, which is to generate data for completing the imbalanced numeric data. Compared with the conventional over-sampling methods, taken SMOTE as an example, the recently-proposed GAN schemes fail to generate distinguishable augmentation results for classiﬁers. In this paper, we discuss the reason for such failures, based on which we further study the restrained conditions between G and D theoretically


I. INTRODUCTION
At present, multiple Generative Adversarial Network (GAN) schemes [1], [2] have achieved significant progress in generating images and enhanced the accuracy of the classifier, where some of the GANs can produce almost indistinguishable images from the human visional examination. In recent two years, several GAN models have been proposed for numeric data augmentation, where numeric dataset may be The associate editor coordinating the review of this manuscript and approving it for publication was Hao Luo . sampled imbalanced so that the classifiers perform poorly due to the training samples in different categories varies greatly, i.e. the positive samples outnumber the negative samples or verse vice [3]. Nowadays, such GANs have been applied to the credit card fraud dataset [4]- [6] and the telecom fraud dataset [7].
However, compared with the conventional augmentation methods, taken Synthetic Minority Over-Sampling Technique (SMOTE) [8] as an example, the GAN based methods have not exhibited many advantages. Table 1 demonstrates the AUC form Random Forest classifier (RFC) as an example. It can be seen that the SMOTE augmented data leads to a higher AUC under RFC on four datasets [9]. In section IV of the experiments, it is found that conventional GANs cannot distinguishably improve AUC in several classifiers.
Through the study of the GAN [10], [11] model, we proposed the assumptions for the possible causes of the failures of GANs in generating numerical data: 1) the lower dimensions on numeric datasets and the stronger correlation between the values on each dimension. Numeric data is usually low dimensional, such as Pima Indians Diabetes datasets with 8 dimensions and SPECT Heart datasets with 22 dimensions, in contrast to the image datasets, where the CIFAR-10 dataset with 32 × 32 × 3 dimensions and MNIST dataset with 28 × 28 × 1 dimensions. In processing numerical dataset, GANs may be overfitted in very high probabilities hence a stronger restraint in GAN may be required to avoid such overfittings. 2) Dimensions generally represent concrete meaning in numerical dataset, such as in Pima Indians Diabetes dataset, there are columns for ages and genders. In contrast, the values in an image are just a pixel with very little practical significance. Hence, the GAN-based augmentation requests a much stronger generator (G) for a more compact and accurate probability distribution in every dimension.
In exploring the causes of GAN-based methods failure, we further analyze the structure of GAN through directed graphical model (DGM) representation [12]. The basic training process of GAN can be shown in Figure 1 (a). We define the generator and discriminator in GAN as G and D. G and D are all widely considered as tensors. The loss function of GAN is to measure the distance between the generated data distribution and the real data distribution, the expectations of generated data distribution and the real data distribution which are represented by E G and E, respectively. (see related work for the concrete mathematical definition). In mathematics, we can see E and E G are the functions on D and G and hence could be considered as functions of tensors. In this case, we might simplify that all the tensors G, D and E, E G to a variable (as a matrix) in Hilbert space, on which all of the tensor and functions are considered as a transmission matrix transforming the input vectors to the outputs as in different dimensions. Subsequently, in applying DGM to describe the GANs, the nodes are the variables D, G, E and E G and the edges in DGM represent the training optimization process of GANs. Similarly, we may also define the restraint as a function on networks, as tensor G and D, therefore a node as F res . We will explain how the edges between G, D and F res could improve the network performance. Besides, we consider that such a specific restriction can effectively 1) prevent the over-fitting of G and D, 2) train a stronger G for the more clearly distributions of the numerical samples. In the following work, we will present more details in the DGM-based analysis of GAN in Section III.A, a. Then we propose how to quantify the calculation for such a descriptive restraint in Section III.B,. Finally, in Section III.C, we give several GANs of available restrained structures. They are isomorphic (IWGAN) mirror (MWGAN), and self-symmetric WGAN (SWGAN). In Section IV.A, we confirm our two conjectures through experiments on four widely studied datasets.
1) The restrained structure can highly improve the performance of GANs. We compare restrained WGANs (SWGAN, MWGAN, IWGAN) with three other GANs: conventional WGAN [11], adapted GAN proposed in 2017 [6], and GAN-DAE in 2018 [4]. Besides, the most widely used oversampling method, SMOTE [8], and is employed in the evaluation. Experiments prove the restrained WGANs improve in 17/20 groups of experiments compared with WGAN. Moreover, IWGAN outperforms all others in 15/20 groups. In the remaining five groups of experiments, the AUC of IWGAN has three second best and two third-best indexes on AUC. Besides, Multidimensional scaling (MDS) [13] is also introduced to eliminate the impact of datasets and evaluation of the AUC in a composite index. IWGAN generally decreases the MDS distance by 20% to 40%. the convergence speed of IWGAN is increased, and the initial error of loss function is reduced.
2) SRC can effectively measure the strength of the restraint between G and D in a GAN. We first detect three SRCs on WGAN with different restrained G-D networks and then compare the performance of WGAN in the experiments. We find the results that three SRCs on restrained WGANs are higher than conventional WGAN and when G and D are isomorphic, the SRC is the highest, and the generated data is the best to improve the classifier effect. That is, the mutual restrained of G and D is strongest in the isomorphic construction.
The remainder of the paper is organized as follows. We introduce an overview of previous related works on GAN and data augmentation in Section II. We provide the proposed restrained GAN and the analysis through DGM in detail. Also, a quantitative method to measure the SRC between G and D in GANs is proposed in Section III. We show the further improved performance of restrained WGANs, especially IWGAN than original data, SMOTE, and other GANs using five classifiers on four representative datasets in Section IV. Finally, Section V presents the conclusions and outlines possible directions for future research.

II. RELATED WORKS
GAN consists of two models, a generator model defined as G and a discriminator model defined as D. GAN is designed based on the idea of competition. [9]. Besides, the objective of G is to confuse D while the objective of D is to distinguish the instances generated by G and the true instances in dataset. More detailed, G: Z → X where Z is the noise space of arbitrary dimension d. Z that corresponds to a hyper parameter and X is the data space, aims to capture the data distribution. D: X →[0, 1], estimates the probability that a sample came from the data distribution rather than G. G and D compete in a two-player min-max game with value function: GAN is mainly used in the field of images to enhance the accuracy of the classifier [1], [2]. However, the above methods have a series of problems such as collapse problem and the loss function of the generator does not converge. In the process of improving the GAN model, most of the literature is discussing 1) the structure of the model, that is, multiple discriminators D and multiple generators G train with each other such as DualGAN [14], CycleGAN [15], DiscoGAN [16], etc. 2) Some authors improved the loss function, such as WGAN [11], LSGAN [17] etc. 3) GANs have also been modified to generate images of a given class by conditioning on additional information, such as cGAN [18] and infoGAN [19] etc. Wasserstein GAN (WGAN) proposed by Arjovsky et al. [11] completely solved the problem of GAN training instability. WGAN uses Wasserstein distance (Earth-Mover): where (P r , P θ ) denotes the set of all joint distributions γ ∼ (x, y) whose marginal are respectively P r and P θ .Through Kantorovich-Rubinstein duality, the loss function of WGAN is shown as follows: In the process of GAN training, we define E x∼P r as E and E z∼P z as E G in paper. Compared with GAN, the problem of the collapse mode is almost solved by WGAN, ensuring the diversity of the generated samples. Recently, GANs have been used to generate samples to improve classifier performance in credit card fraud detection [4]- [6] and other imbalanced datasets [20]. Zheng et al. [7] adopted a deep denoising autoencoder to learn the complicated probabilistic relationship among the input features effectively and employed adversarial learning that established a min-max game between a discriminator and a generator to accurately discriminate between positive samples and negative samples in the data distribution. Larsen et al. [21] presented an autoencoder that leveraged learned representations to better measure similarities in data space. By combining a variational autoencoder with a generative adversarial network, it can use learned feature representations in the GAN discriminator as the basis for the VAE reconstruction objective. However, the current GAN methods cause the unsatisfactory performance improvement of the classifier on the numeric datasets compared with SMOTE [8], which is applied widely to the data generation. Through the discussion of the advantages and disadvantages of the GAN [10], [11] model and thoughts on our experimental phenomena, the less effectiveness of GAN could be due to two factors, the dimensional differences and the representations in each dimension, as discussed in the introduction section.
After GAN is put forward, it is always a problem to judge the generation effect of GAN. In the beginning, the generation effect of GAN was evaluated by people to judge whether the generated samples are similar to the real samples. However, for different GAN, it is not suitable to subjectively compare which method produces better results. Therefore, a quantitative evaluation method is needed. In recent years, some evaluation indexes, such as Inception Score [22], Mode Score [23], and Fréchet perception distance (FID) [24] have also been proposed to evaluate GAN. However, these indicators are all used to evaluate the quality of GAN generated images, which cannot be used to evaluate the effect of numerical data generation.

III. PROPOSED METHOD
This section analyzes the structure of GAN through directed graphical model (DGM) representation [12] and describes the proposed method for generating high-quality numeric data based on the restrained structure.

A. RESTRAINED WGANs AND IMPROVEMENT
In this section, we abstract the function space of tensor G, D and loss function into a Hilbert space. Because G, D are generator and discriminator and they are all widely considered as tensors. Besides, E, E G are the expectation of the real data and generated data distribution in the loss function used by tensor to calculate. Without loss of generality, we consider it is reasonable to use simplify tensor G, D, restrained function F res , E and E G to a same Hilbert space so that a calculation, represented as an edge in a DGM, can be drew and a Graphic model could be setup to represent the relationship and training process of the GANs. It is noticeable that the DGM in this section is only used to discuss the influence of restraints network structures in G-D training, and does not involve numerical calculation. We just aim to analyze, theoretically, why a restrained structure in G-D pair could improve the performance of GAN.
In the training process of GAN, we consider adding some relationship between G and D. For example, G and D conform to some structural relationship. In this way, G and D is defined as satisfying an affine transformation in Hilbert Space. Therefore, we could define a restrained function that acts on the G → D Hilbert space given by It makes the distribution of the generated data closer to the real distribution, especially for the structural data with strong restraints. Correspondingly, in DGM, we introduce a hidden variable res. Then F res on DGM is shown in Figure 2 (a). And the discriminator function becomes the res (G). It is noticed that we do not change the learning process of GAN, where is the GAN still try to solve the optimal D (just we have D = res(G)), not hidden variable res in the training. By drawing into res, we can see that two flows from D ↔ res and G ↔ res are added in DGM. Then the whole process of GAN can be represented by Figure 2 (b).
We further employ the DGM to expound the training mechanism of GAN. Then the whole process of GAN can be represented by DGM in Figure 1 (a). Moreover, for G and D learning, we define the function f to represent the learning process for G and D gave the observed random variables After factorization in DGM, there are Through adding the restriction between G and D in the process of learning, the learning function becomes Due to the existence of F res → res between G and D, f (G|D) for learning G and f (D|G) for learning D contain an extra restrained of D ↔ G, which are f res (G|D) in Equation (9) and f res (D|G) in Equation (10) added in the learning process of DGM. From a certain point of view, the hidden restraints of D ↔ G by adding mutual restraints in G-D training, we consider that this can avoid over-fitting to a certain extent, and train a stronger G, which is also verified in the experiment in Section IV.

B. QUANTITATIVE INDICATORS OF RESTRAINED STRUCTURE
As discussed in III.A, we would like to study the restraints on a network structure, rather than a specific loss function for a particular network. Such restraints are obviously difficult to be quantified. In other words, we can design many kinds of restraints, as presented in the following Section III.C, but we can hardly judge which restraints are more effective expect carrying out experiments on parameters searching and crossvalidation exhaustively. Hence in this paper we propose a quantitative method motivated by Tian et al. [25] to evaluate the Similarity of the Restrained Condition (SRC) between the G and D in GANs. The SRC of restrained network structures VOLUME 8, 2020 shown as follows: For . Besides, where f j andf j • is the weight of each node in the neural network of D and G, respectively. And we normalize f j and f j • to getf j andf j • . The correlation coefficient ρ jj • is obtained by inner product off j andf j • of D and G.
The higher the SRC, the stronger restrained effect of D and G in GAN presents. Also, it is proved that when G and D are isomorphic, the SRC is the highest, and the generated data is the best to improve the classifier effect through the experiments in section IV. That is, the mutual restrained D ↔ G of the isomorphic construction in GAN is strongest. Besides experiments also show that the SEC can effectively measure the strength of restraints. The higher the SRC is, the stronger the restraint is in the GAN and then the better the effect is on data augmentation. See evaluation on the SRC experiment in section IV.B..

C. RESTRAINED STRUCTURAL CANDIDATE SOLUTIONS
We design several restrained network structures between G and D in WGAN for data generation. The structures of isomorphic (IWGAN) mirror (MWGAN) and self-symmetric WGAN (SWGAN) are shown in Figure 3. We define the isomorphic, mirror, and self-symmetric structure for the G-D pair. Here the isomorphic structure is defined as that the two networks have the same number of layers, each layer has the same number of nodes, and every two neighboring layers have the same connection The mirror structure is defined as a mirror-symmetrical network structure. The self-symmetric structure is defined as the symmetrical network structure.
In the candidate solutions, three restrained network structures satisfy the one-to-one mapping from G space to D space. Assuming G function and D function follow some distributions in the two spaces. The three restrained network structures are the simplest structure for the transformation of these two distributions in theory. Other structures may not satisfy the one-to-one mapping so that the transformation probability distribution (i.e. f (G|D) and f (D|G)) of G ↔ D pairs are likely turned into N-P difficult problems without stable optimal solutions. Moreover, uncertainty G leads to the instability of P x distribution, which determined by G. That is an unstable output structure. These results do not exhaustively show all possible restraints, but they do have a very standard and enlightening structure. Obviously, in the follow-up study, researchers can design a new structure according to these structures and the SRC we give.

IV. EXPERIMENTS AND RESULTS
This section presents the experimental study to evaluate the performance of the restrained WGAN for generating numeric data. The experiment setup is introduced, then results and discussions are presented and described respectively.

A. EXPERIMENTS
The experiment runs on the following specification: i7 CPU, 32G RAM and TESLA P100 11G GPU using Python 2.7 and TensorFlow 1.0. And the evaluation study is directed to determine the influence of the performance of the three restrained WGANs, SMOTE and other GANs, using a variety of classifiers and the metrics, which is AUC, on four datasets. According to this aim, experiments are conducted as follows:

1) DATASETS
We conduct experimental analyses based on four datasets, which are obtained from the University of California Irvine (UCI) machine learning repository [9]. The four binary classification datasets with various imbalance ratios are Australian Credit Approval data, German Credit data, Pima Indians Diabetes data, and SPECT heart data, as shown in Table 2.

2) EVALUATION METRICS
Through the inspiration of the papers [13], [26], [27] on evaluating data generation performance in classification, we generate data in four data sets to evaluate the data generation effect on AUC. Finally, we introduce MDS [13] to eliminate the effect of different datasets. In a similar way to the work by Caruana and NiculescuMizil [28], for each metric, we built a 25 × 4 table, where 4 denotes the number of datasets in our experiments. Each entry (i, j) in the table represents the score of the model i on the dataset j. We calculated the Euclidean distance between each pair of rows and then performed multidimensional scaling on the matrix of these pairwise distances between models in order to obtain a projection onto a 2-dimensional space [29]. Moreover, we calculate the Euclidean distances to the optimal point in the MDS space for four datasets, which are integrated into one scalar, respectively, to output a general description regardless of dataset dimensions.  applied with the default parameter settings, recommended by scikit-learn [30] for all evaluated models.

4) EVALUATED MODELS
We compare the proposed IWGAN with the state-of-the-arts in augmenting data by GANs, including adapted GAN proposed in 2017 [6], GAN-DAE in 2018 [4], and the conventional WGAN [11] under the baseline provided by SMOTE [8]. Among these methods, WGAN without isomorphic structure is used as another baseline for IWGAN.

5) STRUCTURE SETTING STRATEGY
Finding the best network structure with appropriate hype-parameters is a challenge and exhaustive works in practice, especially in our study, a suitable G-D restrained pair needs to be well designed. Moreover, the best parameter setting required by each dataset is different, which is limited by the data distribution of each dataset. In our work, we have also set up and tried enormous parameter designs. Here we have selected the most representative parameters of each model on SPECT Heart dataset in Table 3 as follows: Partitions: In order to evaluate the performance of the algorithms and the hyperparameter tuning of the GAN algorithms based on the parameter set of paper [31], 10-fold crossvalidation is applied.  In the remaining five groups of experiments, the AUC of IWGAN has three second best and two third-best indexes. Among the datasets and classifiers, the SPECT dataset is the least sensitive to the classifier and augmentation methods, while the other three datasets demonstrate the distinguishable result. About classifiers, GBC outputs the relatively best results on all of the datasets, while KNN produces the worst. Affected by the datasets and classifiers, GAN-DAE and GAN generate unstable augmented data. For example, on all of the datasets classified by RFC, the data generated by GAN-DAE and GAN present a worse result than that generated via SMOTE. However, when the datasets classified by SVM, GAN-DAE and GAN generated data illustrate better results than SMOTE in three of the four datasets. In our experiments, IWGAN outperforms all other GANs in all test cases except RFC in Pima dataset. Meanwhile, IWGAN also outperforms the baseline, which is SMOTE, except RFC in Pima dataset and SVM in the SPECT dataset.
In the above experiments, in order to eliminate the impact of datasets. We introduce MDS [13] and calculate the Euclidean distance from the optimal point in MDS space  for AUC. The distances reported in Table 4 corroborate that IWGAN presents better than other GANs and SMOTE for all classifiers. That is, IWGAN gives the optimal values under various classifiers. Other GANs give sub-optimal values in ANN and KNN. Besides, the SMOTE method illustrates sub-optimal values in RFC, GBC and SVM. Meanwhile, compared with the sub-optimal solution, IWGAN generally decreases the MDS distance by 20% to 40%. Compared with the other three GANs, IWGAN generally decreases by more than 50%.

2) EVALUATION ON THE SRC
We select the final trained baseline-WGAN and the three restrained WGANs to calculate the SRC and get Table 5 from the four datasets. Table 5 shows the comparison of various relative WGANs in SRC on four datasets. The three restrained WGANs are higher than conventional WGAN as the baseline in SRC. Besides, we find that when G and D are the isomorphic structure, the SRC is the highest compared with other structures of WGANs. In particular, it can be seen that IWGAN is 37% and 22.5% higher than other structures of WGANs in Dateset 3 (Pima Indians Diabetes dataset) and Dateset 4 (SPECT Heart dataset), respectively. By analyzing the experimental results, the mutual restrained of D ↔ G explains why adding a restrained structure can generate stronger data for classifiers, improve the classification performances, and have a faster convergence speed. Also, the mutual restrained of D ↔ G is strongest in the isomorphic construction.

3) EVOLUTIONS FOR CONVERGENCE SPEED
In the 10-fold cross-validation experiments, Occasionally, GAN and GAN-DAE do not converge, which is a widely discussed disadvantage [10], [11]. Therefore, in the discussion of convergence, we only compare the convergence of WGAN and IWGAN about generator G, as shown in Figure 5. It can be seen that IWGAN produces smaller initial loss functions in Australian Credit Approval, German Credit, and SPECT heart three datasets. Moreover, the initial error of IWGAN is only about 1/10 of WGAN. The initial error in Pima Indians  Diabetes dataset is similar to that in WGAN. In terms of speed, the convergence rate of Pima Indians Diabetes datasets increases significantly. Moreover, the convergence speed of the Australian Credit Approval and SPECT heart dataset improved slightly and were relatively stable. German Credit dataset's convergence rate has been slightly reduced. In general, on the Australian Credit Approval dataset, German Credit dataset, and SPECT Heart dataset, the initial learning of G of IWGAN is close to the optimal global value. Another case on the Pima Indians Diabetes dataset is that although the initial G is not close enough to the optimal value, it can quickly approach the optimal value through restraints. In summary, IWGAN with isomorphic structure can enhance convergence performance.
These two types of different performance on convergence because G-D pairs, f (G|D) and f (D|G), have different restraints performance explained by DGM. In one case, f (G|D) results in more substantial optimization steps, which leads to fast convergence, but the loss function produces oscillation near the convergence point, as shown in Figure 5 (b). The other case is fast approaching the optimal solution, although the convergence step is small. Both cases reflect the validity of f (G|D) restraints in learning G. However, we are still studying the network structure and f (G|D) corresponding to its optimization step. It is noticeable that the blue line of WGAN in the Figure 5 is lower than the red line, which means smaller E G loss, but not the optimal value, and only the feedback error to D smaller. That is, it does not correspond to the better solution of this iteration. From equation (3), we can see that E G is the latter part, but the optimal solution corresponds to the global minimum distribution error, that is, the whole formula. Our previous experiments have proved that IWGAN is more effective. In this experiment, we observe the convergence rate and stability.

4) EVALUATION OF ISOMORPHIC STRUCTURE
Some isomorphic functions may exist even if G and D do not satisfy the requirements of the same numbers of layers and the same layers. We set up different IWGANs with relative isomorphic structures which have the same number of layers, but the nodes' number of D and G are ±10%, ±20% and ±30% different. In Figure 6, it can be indicated that the spots represent the AUC of each data augmentation algorithm under various classifiers in the German credit dataset. Moreover, dotted lines represent the trend of different IWGANs. It can be seen that RFC has an evident trend, that is, the smaller the difference of the nodes' number in G and D, the higher the AUC of the RFC classifier presents. The other three classifiers (KNN, GBC, ANN) have an inevitable trend. SVM has a relatively stable AUC value for different IWGANs. The reasons for SVM's stable performance is being explored. In conclusion, Experiments expose that the same layer and node can better represent the isomorphic structure. The experiment illustrates that the technical settings are not entirely isomorphic and can improve AUC on various classifiers. In other words, in Section III, if the theoretical function decomposition corresponds to the isomorphism of some subfunctions, it may also generate better data. It attracts our attention to design a more effective (G-D), to promote the performance on numeric data augmentation and even image generation in the future work.

V. CONCLUSION AND FUTURE EXTENSIONS
This paper proposed the restrained GAN based on isomorphic (IWGAN) mirror (MWGAN) and self-symmetric WGAN (SWGAN) structures for data augmentation on four UCI publicly datasets. Moreover, the DGM analysis theoretically proves that the restrained structure between G and D provides an additional restriction in learning G from D, and verse vice, respectively. Besides, for the non-quantifiable restrained network structures, we propose a quantitative method to measure the SRC between the G and D in GANs.
In the common metrics, AUC in four datasets on five classifiers compared with three other GANs, and the conventional SMOTE methods add up to 20 groups of experiments. Experiments prove the restrained WGANs have been improved in 17/20 groups of experiments compared with WGAN. Moreover, IWGAN outperforms all others in 15/20 groups. In the remaining five groups of experiments, the AUC of IWGAN has three second best and two third-best indexes. The convergence rate of IWGAN is increased, and the initial error of loss function is reduced. MDS [20] is also introduced to eliminate the impact of datasets and evaluation of the AUC in a composite index. IWGAN generally decreases the MDS distance by 20% to 40%. Subsequently, we set up different restrained WGANs as candidate solutions with isomorphic, mirror, and selfsymmetric structures in technically. Besides, through several experiments, we find that the restrained WGANs have higher SRC than conventional WGAN, and the SRC of IWGAN is the highest compared with other structures of WGANs. Also, some isomorphic functions may exist even if G and D do not satisfy the requirements of the numbers of layers and the same layers. Technically, we set up different IWGANs with relative isomorphic structures, and find that the smaller the difference between the number of nodes in G and D, the higher the effect presents. If the theoretical function decomposition corresponds to the isomorphism of some sub-functions, it may also produce promotion, which attracts our attention to design a more effective G-D pair in future work. Relevant follow-up studies may inspire us to create other forms of GAN. Even if the isomorphism is technically imperfect or may correspond to other mapping relationships, it is still partially valid. In a word, the structure between G and D attracts our attention to design a more effective G-D pair in future work. Because of this, relevant follow-up studies may inspire us to create other forms of GAN. We further study the other restrained structure of GAN partial layers in other cases, to improve the performance of GAN-based models on numeric data augmentation and even image generation.