Face Recognition via Multi-Level 3D-GAN Colorization

Rapid development in sketch-to-image translation methods boosts the investigation procedure in law enforcement agencies. But, the large modality gap between manually generated sketches makes this task challenging. Generative adversarial network (GAN) and encoder-decoder approach are usually incorporated to accomplish sketch-to-image generation with promising results. This paper targets the sketch-to-image translation with heterogeneous face angles and lighting effects using a multi-level conditional generative adversarial network. The proposed multi-level cGAN work in four different phases. Three independent cGANs’ networks are incorporated separately into each stage, followed by a CNN classifier. The Adam stochastic gradient descent mechanism was used for training with a learning rate of 0.0002 and momentum estimates $\beta $ and $\beta $ as 0.5 and 0.999, respectively. The multi-level 3D-convolutional architecture help to preserve spatial facial attributes and pixel-level details. The 3D convolution and deconvolution guide the G1, G2 and G3 to use additional features and attributes for encoding and decoding. This helps to preserve the direction, postures of targeted image attributes and special relationships among the whole image’s features. The proposed framework process the 3D-Convolution and 3D-Deconvolution using vectorization. This process takes the same time as 2D convolution but extracts more features and facial attributes. We used pre-trained ResNet-50, ResNet-101, and Mobile-Net to classify generated high-resolution images from sketches. We have also developed, and state-of-the-art Pakistani Politicians Face-sketch Dataset (PPFD) for experimental purposes. Result reveals that the proposed cGAN model’s framework outperforms with respect to Accuracy, Structural similarity index measure (SSIM), Signal to noise ratio (SNR), and Peak signal-to-noise ratio (PSNR).


I. INTRODUCTION
Crime has taken a drastic revolution, so it demands enhancing the security of forensic files and records. There is an increased requirement to use technological measures in crime to identify, detect, and recognize suspects. For safety and security-related prompts, biometric recognition is necessary. One of the most common biometric techniques is face recognition. The face is the most convenient and reliable way The associate editor coordinating the review of this manuscript and approving it for publication was Claudio Zunino. of identification. Face-sketch recognition is a strong face identification domain when the photograph is not available. Face recognition systems have been evolving over the past few decades, particularly with the availability of large-scale databases and access to sophisticated hardware. Large-scale face recognition challenges such as MegaFace [1] and IARPA Janus Benchmark [2] provide further opportunities for bridging the gap between unconstrained and constrained face recognition. Sketch recognition is also an emerging trend in law enforcement agencies to identify suspects [3], [4]. Sketch recognition problems involve automated matching and generating coloured images from sketches [5]. There are two ways to reorganize the suspects by sketches: 1) Convert all the database images to sketches & compare the sketch with the sketch-database images. 2) Another way is to colourize the sketch and then find that colourized face image in the database. The first way is easy and less complex, but we lose too much information during the conversion process of images to sketching. So we are unable to find good & accurate results. On the other hand, if we convert the sketch into a coloured face image, this task is complex and challenging, but it is more effective to find out the suspect effectively.
Generative Adversarial Networks (GAN) have been used to colour the images and it may create sketches from coloured images. Due to the rapid development in GAN models [6], [7], [8], [9], the quality and efficiency of the sketch to coloured image translation have been improved significantly [10], [11], [12], [13], [14], [15], [16], [17], [18], [19]. Currently, translations from sketch-to-image or image-tosketch have been extensively used in law enforcement agencies and digital image entertainment [20], [21], [22], [23], [24], [25], [26], [27], [28]. Zhang et al. [21] developed an architecture based on the dual transfer face sketch technique to improve the identification performance of sketched images. The Dual-transfer sketch approach comprises an intra-domain and inter-domain transfer process. It is used for identity-specific information loss and retrieval of common facial structures. Unlike a dictionary-based traditional approach, Zhang et al. [29] developed an end-to-end deep convolutional neural network (CNN) model for an image-toimage translation, while Isola et al. [30] work on conditional GAN by adding new condition y in traditional GANs. The condition y is used along with the input layer to handle the mapping between generated image and the input image. Zhang et al. [20] developed a generator for face sketching that addresses the problem of sketch generation using soothing properties. This work outperforms for the reduction of high-frequency loss with considerable performance. Zhang et al. [22] developed an automatic sketch generator comprising rough, fine, and finer face parts. The model colour the face sketch using mentioned parts with gentle and deep detail features. Probabilistic graphical models were used by Zhang et al. [23] to develop the face sketch architecture. They considered the generated sketch pixels and ground truth from training data to generate the face with fine details features. Zhang et al. [24] address heterogeneous lightning effect problems by developing cascaded face sketch synthesis models. This model comprises cascaded low-rank representation and numerous feature generators. The responsibility of the feature generator is to extract finely detailed features under different illumination. At the same time, the distance between the synthesized facial sketch and the corresponding ground truth was reduced by cascaded low-rank representation. To improve the efficiency of face sketching, Wang et al. [25] used new random sampling instead of an online KNN search method. Results show that this technique outperforms concerning quality and efficiency. Current face sketching techniques cannot select the neighbour feature during face synthesis. Bayesian techniques were used by Wang et al. [26] to consider the weight computation model and neighbor selection model to overcome it. This method competes with the existing techniques concerning subjective perceptions and objective evaluations. However, recent research [3], [4], [28], [31], [32], [33] [23], [34], [35] ignores the 3D-convolutional process for sketch to image colorization and controlling of pixellevel facial attributes during sketch to colored face translation. Existing sketch colorization and face sketching techniques are unable to outperform with heterogeneous lighting effects and fail to taking into account the neighboring features during face synthesis. The attention of all researchers was to achieve a balance between the target image and generated image to look more realistic and natural. But, due to minor facial feature selection, realistic and natural image generation is not achieved effectively. If the number of features increases directly, the desired results may be achieved. Still, it increases the complexity and depth of the model, which requires more computing power and other computational resources. The alternative way of increasing the feature is to increase the depth of convolutional layers and apply 3D convolutions instead of 2D. In addition, the existing research focuses on 2D-convolutions [36], [37] instead of 3D, which reduces the efficiency of the loss learning function to preserve spatial facial attributes of the input image. Existing research works does not provide any ground truth that may authenticate the performance using cross-match analysis.
Most of the time, the photos of suspects obtained from surveillance cameras are poor in quality, so forensic experts draw face sketches of suspects and colour them to retrieve them from the database. To enhance retrieval performance and efficiency, we can synthesize face sketches from photos in the database and then match them with the suspect's sketch.
To overcome the problems mentioned above, a multilevel 3-dimensional conditional generative adversarial network (3D-cGAN) is proposed to translate and colourize sketches into realistic images. The proposed model translates and colours hand-drawn sketches into high-resolution RGB realistic images. It also controls spatial features and pixellevel details without affecting realistic attributes by imitating the condition. In addition to generating high-resolution RGB realistic images from sketches, the proposed model can also classify and recognize the input images. This architecture comprises four phases i.e., three cGANs followed by an image classifier. Each cGAN comprises Generator and Discriminator. The generator handles the 3D facial attributes VOLUME 10, 2022 during face sketch colourization and translation based on a conditional encoder-decoder network. It will be achieved by decoding optimum features extracted by the encoder, availing conditions. The framework converts the sketch into a highresolution RGB image and classifies them. The whole process work in four different steps: In the first step, the input sketch is converted into a grayscale image. Secondly, the grayscale image is converted into an RGB image with the consideration of facial attributes. In the third step, the RGB image is converted to a high-resolution RGB image using a pixel modifier. The high-resolution RGB image is classified and labelled concerning the relevant class during the fourth step.
We have developed a face dataset for experimental purposes that consists of 1000 face images of 100 people (10 images per person). Each image is preprocessed and distributed into four versions: original RGB image, manually drawn sketches, Grayscale image, and high-resolution image for cross-match analysis. So, as a result, we have developed a fine-tuned state-of-the-art 4000 face image dataset. This dataset comprises 1000 original RGB images with different face positions for extracting spatial facial attributes, 1000 manually drawn sketched images, 1000 grayscale images, and 1000 high-resolution RGB images. These manually generated images and sketches work as training data and ground truth for cross-match analysis to authenticate the proposed model performance.
The key contribution of our research work is as follows: 1. First, we developed a multi-level 3-dimensional conditional generative adversarial network (3D-cGAN) that will colour and translate the sketch into realistic images and preserve spatial facial attributes and pixel-level details. 2. We process the 3D-Convolution and 3D-Deconvolution using vectorization, which trains more attributes and parameters without extra time consumption. 3. We also generate high-resolution RGB colour images from sketches that will be more realistic images. 4. The proposed technique also considers the spatial domain's heterogeneous lightning effect and neighbour feature selection. 5. We introduced a face dataset that consists of 1000 face images with four categories of 100 people (PPFD). 6. This work provides ground truth for each image at multiple stages that authenticate the performance of the proposed architecture using cross-match analysis.

II. RELATED WORK
Face recognition or person identification has been achieved by mutually using soft and hard biometric traits [39]. It is well-known that sketch information and facial attributes give more authentic results than sketch alone. It is due to the non-availability of complementary information in sketches such as skin, eye, hair colour, and ethnicity. Furthermore, other attributes like eyeglasses or wearing a hat would be considered secondary information to narrow down the results. In [40], Klare et al. proposed a direct approach for suspect identification using facial attributes without a sketch. Mittal et al. [41] try to increase the accuracy of their proposed algorithm by fusing multiple sketches and considering soft biometric traits like skin colour, ethnicity, and gender to reorder the ranked list of the suspects. Another framework has been developed by Ouyang et al. [42] to reduce the gap between photo and sketch by combining low-level features with facial attributes. The GANs have been widely used in image generation [7], [24], [29], [43], [44], [45], image translation [46], [47], and image synthesis [12], [13]. Recent literature regarding deep learning approaches [48], [49], [50], [51], [52], [53], [54] focuses more on face recognition and classification problems than classical methods [55], [56], [57]. These approaches can also be used for sketch-photo recognition problems. Face recognition through sketch is more complicated and challenging than classical face recognition problems. The main reason behind this is the heterogeneous nature of photo and sketch modalities and the non-availability of large datasets. For example, most datasets generate only a single sketch per face, making it challenging for a deep model to learn robust features [58]. Another CNN-based work with a new optimization objective function was introduced by Zhang and Lin [29] for end-to-end face image-to-sketch translation. They target the preservation of input image features during translation. Zhu. et al [45] are trying to solve the problem related to the non-availability of paired training data by introducing a new architecture cycle GANs. This GAN network tries translating the input images into target images without using paired training samples. Li et al. [59] proposed a deep CNN model named VGG-Face to overcome facial attribute preservation during translation. This model generates the expected output image based on desired facial attributes. Att-GAN was introduced by Zuo et al. [60] and worked as an attribute classifier and tried to guarantee the generation of correct faces based on desired facial attributes. Recently, conditional GANs networks [43] have greatly emerged in the image generation domain. These networks perform work based on conditions that are given as input. Based on cGAN, Karras et al. [61] introduce a substitute for the generator in GAN networks that can isolate stochastic variation and high-level facial attributes. This generator helps to generate high-quality facial images. In [30], Isola et al. developed a pix2pix architecture of GAN for image colourization, sketchto-image creation, and semantic segmentation. An improved version of [56] has been proposed by Wang et al. [62], named pix2pixHD. This network demonstrates cGAN application concerning semantic label maps in the image generation domain. Hand-drawn sketches have been colourized by Sangkloy et al. [19] by taking user-centred sparse colour strokes as conditions. Researchers have explored componentbased methods [16] for human face image generation by taking high-level features of human faces. Wu and Dai [63] introduced a three-step mechanism for sketch-photo-sketch conversion. They took sketches as input in the first step and then matched them with a face image dataset. The second step colored the sketch with the best-fit face image. The sketch is re-drawn from the generated image during the last step to authenticate the output images. The problem with this technique was that it required a well-drawn sketch as input. Gu et al. [59] enabled component-level controllability of facial attributes using auto-encoders with the learning of embedding features from individual face components. They used mask-guided generative networks for the fusion of component feature tensors. cGAN networks have also been used to localize facial images using both facial sketches [8], [64] or semantic label masks [65], [66]. The semantic label maskbased editing is more flexible concerning style transfer and component transfer, while the former approaches give fine and direct control of facial components. To overcome the errors in manually generated sketches, [67] Portenier et al.
Pixel-level details of facial attributes were not preserved accurately due to 2D convolutions of loss learning functions.

III. METHODOLOGY/RESEARCH PROCESS A. DATASET PREPARATION
For experimental purposes, we have developed a face dataset that comprises a total of 100 participants. Each participant collected 10 facial images with different face positions and angles. So, a total of 1000 images were collected. These images with different facial positions and angles along with heterogeneous lighting effects are shown in Figure 1.
After that, a preprocessing phase was initiated, and four versions of each image were generated: original RGB image, manually sketched, grayscale image, and high-resolution image. Sketching and manual enhancement was performed under the supervision of expert artists and photographers. In this way, our fine-tuned facial image dataset is equipped with 4000 images with four categories: 1000 original RGB images, 1000 manual sketches, 1000 grayscale images, and 1000 with super-resolution, as shown in Figure 2.
The original image size is 256 × 256 × 3 with normal quality, high-resolution image 256 × 256 × 3 with high

B. PROPOSED FRAMEWORK ARCHITECTUR
The proposed model comprises four major phases. The first three phases are cGAN networks that generate images from sketch to high-resolution images step by step. In the proposed framework, every GAN network is a modified form of U-Net [70]. The final output of the first three phases acts as input to the fourth phase for classification and recognition. The fourth phase of the network contains the state-of-theart CNN network. We re-trained three CNN networks for classification and recognition, i.e., ResNet-50, ResNet-101, and Mobile-Net. Based on the classification and recognition model selection, one of the above networks is selected for the classification and recognition of the input. Figure 3 shows the general framework of the proposed work. In this figure, the sketch image is an input of the first GAN G1. It encodes the sketch input and generates a grayscale image. The grayscale output of the G1 will be the input of the second GANG2. TheG2 of the framework executed the grayscale image and generated an RGB image as output. The RGB image is given to the third GAN G3 to generate highresolution images. Image encoding and decoding processes are completed at this stage, and high-resolution images are passed to the CNN network for classification and recognition.

1) GAN ARCHITECTUR
The G1, G2 and G3 have the same architecture for a sketch to grey, grey to RGB and RGB to high-resolution functionality, respectively. Each GAN Network of the proposed framework includes Generator and Discriminator. The internal architecture of the Generator and Discriminator is as follows a: GENERATOR ARCHITECTURE The generator of the GAN network consists of two blocks: Encoding and Decoding.

i) ENCODING
The encoding block extracts the features from the input image and encodes them into optimum features. The encoding block of the proposed GAN network's generator consists of eight 3D-encoding sub-blocks. Every sub-encoding block contains convolutional layers with a stride size of 2, Leaky ReLU, and batch normalization. During the preprocessing phase, the input image is resized into 256 × 256 × 3. The generator of the GAN network accepts images with the size of 256 × 256 × 3, and all encoding blocks convert them into 1 × 1 × 512.

ii) DECODING
The decoding block of the generator has seven sub-decoding blocks. The decoding block upsamples the input from 1 × 1 × 512 to 256 × 256 × 3. The sub-decoding block applies transposed 3D convolution, batch normalization, and ReLU. The dropout of 0.5 was also applied only to the initial three sub-decoding blocks. This dropout was applied after convolution and batch normalization before ReLU to achieve suitable noise removal while maintaining the original face texture and features. Each sub-decoding block gets input from the previous block and the corresponding same-sized sub-encoding block. This approach helps to improve feature selection and texture preservation during encoding and decoding. The decoding part of the generator tries to generate a target image closer to the ground truth. The complete process of the encoding and decoding phase of the proposed cGAN is described inFigure 4.

b: DISCRIMINATOR ARCHITECTURE
The discriminator also comprised seven different sub-blocks. The initial three sub-blocks of the discriminator are similar to that of the sub-encoding blocks of the generator. After three sub-blocks, two separate convolutions are applied with a stride size of 1 for feature purification and preservation of the input. After that, batch normalization and Leaky ReLU are applied. The discriminator received two inputs: 1) an Image generated by the generator and 2) Target Images as ground truth. The primary function of a discriminator is to find discrimination between generated and ground truth images. It finds how much-generated images differ from the ground truth. Finally, the generated image of the discriminator of size 30 × 30 × 1 is used to decide generation quality, as shown inFigure 5.

2) CNN NETWORK
The final phase of the proposed network comprises three pre-trained CNN networks, as shown in Figure 3. 1) ResNet-50, 2) ResNet-101, and 3) Mobile-Net. These networks are used for the classification and recognition of colourized high-resolution images. We adopted the transfer learning technique for the training purpose of these CNN networks.

C. TRAINING DETAILS 1) PARAMETERS USE
The generator of the proposed model used 163,577,577 parameters in all GAN stages. These stages are sketched to grayscale conversion, grayscale to RGB, and RGB to the high-resolution image. The discriminator used a total of 8,311,299 parameters to find the originality and quality of the generated image. The CNN models, i.e., ResNet50, ResNet101 and mobileNet used 197525588, 216552832 and 175393895 parameters, respectively for classification purposes. The complete details of the total, trainable and nontrainable parameters of all stages are given in Table 1.

2) TRAINING PROCESS
The traditional Convolutional Neural Networks (CNNs) cannot explain the spatial relationship between features and the whole image. So, it will lose some of the targe's attribute information, such as direction and posture. To utilize the optimum attributes of the target image, the proposed multilevel 3D GAN applies 3D convolution to encode the input image into vectors as shown inFigure 6. The output vector of the encoder is given as input to the decoder to reconstruct the guided coloured-face image. The 3D convolution and deconvolution guide the G1, G2 and G3 to use additional features and attributes for encoding and decoding. This helps to preserve the direction, postures of targeted image attributes and special relationships among the whole image's features. The process of 3D-Convolution and 3D-Deconvolution is  handled by a vectorization process, that takes the same processing time as 2D-Convolution but extracts more features and texture information.
Generally, GAN networks generate the final image y from scratch, i.e., random noise vector z, G : z → y [43]. In the case of the proposed GAN, the network gets two inputs, i.e., random noise vector z and conditional vector x as sketch image, to construct the final output image y, G : x, z → y.
The discriminator D trained adversarially to differentiate the generated vs. real images. The generator generates goodquality images indistinguishable from natural images as long as the generator is trained. The training mechanism of the proposed GAN Network is demonstrated inFigure 7. We used the Adam stochastic gradient descent [71] mechanism to train for the optimal learning rate of 0.0002 and momentum estimates as β1 and β2 as 0.5 and 0.999, respectively. The learning rate was reduced to 0.00001 after 150 epochs for fine-tuning model weights.
Before training, preprocessing phases were initiated to resize the image according to the underlying framework. Then, images are randomly cropped to the target size with horizontal flipping. For the training of the proposed GAN, the G1 trained on 250 epochs for Sketch to Gray transformation. The G2 trained on 300 epochs for grey to RGB image, while 450 epochs were used to train G3 to transform the RGB to a high-resolution image. G3 was trained on 450 epochs due to texture enhancement and feature improvement.
The batch size was set to 1 for all three GANs networks. The proposed GAN took 120 minutes for G1 and 145 minutes for G2, and 230 minutes for G3 on P100s GPU for training. At the same time, every GAN needs approximately 0.35 sec to transform the input to output using the same GPU. So, the proposed GAN model needs only 1.25 sec to convert the sketch into a super-resolution coloured image. For classification, the input size of ResNet-50, ResNet-101, and Mobilenetv2 is 224 × 224 × 3. For the training of CNN networks, we used random crops of 224 × 224 × 3 from highresolution images. Resnet-50 and Resnet-101 models were trained on 45 epochs with a batch size of 128 and a learning rate of 0.0001. While the batch size of 128 and the learning rate of 0.0001 was also set for Mobilenetv2 with 70 epochs.

A. IMAGE GENERATION
The whole proposed multi-level GAN network generates sketches to high-resolution images in three phases. Three GAN networks are incorporated with each other to generate high-resolution images. During the training process, each GAN network was trained independently. The detail of each phase for generating images is given below.
Phase-I: The first GAN (G1) input is a sketch image, shown in Figure 3. For training purposes, we use the proposed PPFD dataset. The train-test ratio for G1 was 7:3. A total of 500 epochs were carried out with a conditional training procedure on 1400 images.
These 1400 images include 700 sketches and 700 grayscale images. The output of G1 was a grayscale image, as shown in Figure 8. As the training starts, G1 generates a noisy and blurry image. But the noise is removed gradually as the number of epochs increases. At epoch no 210, the generated picture is more precise, and at epoch no 300, the generated image is more likely to ground truth, as shown in Figure 8.
Phase-II: Phase-II GAN network (G2) received 700 Grayscale images and 700 coloured images for training. The training of G2 comprises 420 epochs with a learning rate of 0.0002. the output of G2 was a coloured image, as shown in Figure 9. Initially, at epoch no 15, the generated image shows a blurry pattern, but at epoch no 110, the image looks more realistic.
As the execution goes ahead, from epoch no 210 to 300, the facial expression of the generated image shows a more realistic pattern than the ground truth.
Phase-III: The phase-II output is the normal coloured image. To generate a high-resolution image with a more realistic facial attribute, we incorporated G3 with this network. G3 network enhances regions with blurry and nosy patterns to convert normal images into high resolution using conditional attributes. The visual results of G3 on epochs 7, 111, 285, and 390 is shown in Figure 10.

1) GAN EVALUATION
The generator generates output against every input, then the discriminator evaluates the Input image and generates the VOLUME 10, 2022 image in the first step. The discriminator evaluates the Input image and the targeted image during the second step. Finally, generator and discriminator losses were calculated with the gradient loss of the generator and discriminator's input. In this way, all the results were optimized.

a: GENERATOR LOSS
Generator loss is the sigmoid cross-entropy of generated images and an array of ones. Generator loss includes L1, which means absolute error (MAE) between the generated image and the target image. L1 loss helps the generator to generate an image more realistic to the target image. Total generator loss is evaluated by equation 1.
Total Generator loss = gen loss + λ * L 1 loss Here, λ = 100 Initially, the highest generator losses of the proposed model regarding G1, G2, and G3 were calculated as 3.31, 5.47, and 4.33, respectively, as shown in Figure 11. These losses approach zero by improving the accuracy with an increase in the number of epochs. To fool the discriminator, the loss function of the generator tries to improve the generated images near the ground truth. As the number of epochs increases, the learning proficiency increases, and generator loss decreases to 0.35, 0.06, and 0.02 for G1, G2, and G3, respectively, as shown inFigure 11. The proposed model achieved the highest training results at 300 epochs regarding G1 and G2. While for G3, the highest training results were achieved at epoch 400 due to the generation of texture details and high-resolution facial attributes.

b: DISCRIMINATOR LOSS
Two inputs were given to the discriminator loss function: 1) Real image and 2) Generated Image. A combination of sigmoid cross-entropy loss of real image and an array of ones were used for finding real loss. At the same time, the generated loss is the sum of the sigmoid cross-entropy loss of the generated images and an array of zeros. So, the total discriminator loss (L DT ) is calculated by the sum of real loss (LR) and generated loss (LG) as shown in equation 2.  L DT curves of G1, G2, and G3 decrease from 1.42, 1.65, and 1.42, respectively, as shown inFigure 11. The trend line decreases as the number of epochs increases. The graph behaviour reveals that initially, the discriminator beat the generator and classified the generated image as fake. But as the learning of the generator increases up to 150 epochs, the generator tries to generate a realistic image. Finally, the losses decrease after 220 epochs, and the generated image looks more realistic and near the ground truth.

2) IMAGE QUALITY EVALUATION
To evaluate the generated image quality, we have used SNR, PSNR, and SSIM matrices.

a: SNR
Signal-to-noise ratio (SNR) is used in imaging to characterize image quality. The sensitivity of a (digital or film) imaging system is typically described as the signal level that yields a threshold level of SNR, as shown in equation 3. SPNR is the ratio between the maximum Signal's power (Original target image) and the power of the noisy Signal (Generated image). To find the quality of the generated image based on pixels, peak Signal to noise ratio (PSNR) metrics were used. PSNR is formulated as in equation 4. Here MSE is the Mean square error between g and p. Here ''p'' is the predicted or newly generated image while ''g'' is the ground truth image.
The proposed multi-GAN network attains the I include SSIM value as 0.9251, 0.9891, and 0.9940 for G1, G2, and G2. However, I exclude SSIM was calculated as 0.9084, 0.9875, and 0.9661 for G1, G2, and G2 as shown in Table 2.
The comparative analysis of the proposed multi-level 3D-GAN with the existing GAN network is given in Table 3. The results reveal that the proposed model outperforms SSIM, SNR and PSNR. The proposed model outperforms due to the usage of multi-level GAN with 3-dimensional convolutions and deconvolution. 3D convolution extract optimal features and attributes with each pixel's direction and position. The relation among the attribute is also preserved, which helps decode the image with each object's actual position and relation.

B. IMAGE CLASSIFICATION
We proposed a multi-level GAN network with three phases to generate a high-resolution image from a sketch.
The CNN classifier is used to recognize the generated high-resolution image using the transfer learning technique with the help of three pre-trained models, i.e., ResNet50, ResNet101, and MobileNetV2.
The original high-resolution images were used for the training of these models. For testing purposes, we used the generated high-resolution images.
The confusion matrix is the most common and comprehensive way to represent classification evaluation. The confusion matrix includes four classes: 1) True Positive (TP), 2) True Negative (TN), 3) False Positive (FP), and    (FN). The confusion matrix of ResNet50, ResNet101, and MobileNetV2 is given in Figure 12.

4) False Negative
The other classification evaluation matrices used to evaluate the proposed work includes accuracy, precision, recall and F1-score.

1) ACCURACY
Accuracy is used to find how much the proposed model produces accurate results. Accuracy is the ratio of correctly classified images and the total number of images evaluated. The accuracy of the proposed model is calculated by equation 6.

2) PRECISION
Precision is the formulation of finding how many values are positive that are predicted as positive. It is beneficial when we have labelled data about our predictions. The formula for precision is given in equation 7.
Due to deeper structural architecture, the proposed model got a high precision value in the case of ResNet-50, ResNet-101 than Mobile-Net. The precision of MobileNet, ResNet-50 and ResNet-101was 91.77%, 94.74% and 97.43%.

3) RECALL
Another way to evaluate the classification is recall. It helps us to find the ratio between correctly classified values as positives over the total values that are positives. The recall is formulated in equation 8.

4) F1 SCORE
The overall picture of precision and recall can be calculated with the F1-Score. It gives the harmonic mean of recall and precision. The formula for F1-Score is given in equation 9.
ResNet-101 achieved the highest value for F1Score at 0.9733. However, ResNet-50 and Mobile-Net did not play better than ResNet-101 and got the F1Score of 0.9465 and 0.9165.

5) SPECIFICITY
It helps us to find the ratio between wrongly classified values as negative over the total values that are negative. The specificity is formulated in equation 10.
ResNet-101 outperformed in respect of specificity and got 99.70%. While ResNet-50 achieved 99.41MobileNet 99.07%. The values of evaluation matrices are also shown in Table 4.

C. ADVANTAGES AND LIMITATION
This approach has the following advantages over existing techniques.
• The proposed conditional GAN can perform work in 3-phases, i.e., sketch to colour and then high-resolution RGB image.
• The proposed 3D-cGAN can translate sketches into more realistic images by preserving more spatial facial attributes and pixel-level information while using the same processing time as conventional 2D-Convolution.
• We have also developed a state-of-the-art facial PPFD dataset that contains 4000 images with four distinct categories along with a heterogeneous, multi-color, and different Luminus effect.
• Despite this, the proposed 3D-cGAN cannot generate full high-definition like 1024 × 1024 and more images due to the limited computational resources and complexity of convolutional neural networks.

V. CONCLUSION
This work proposed a framework with a multi-level 3D cGAN network to generate high-resolution images from sketches along with a classification network to recognize the image. We developed a state-of-the-art PPF dataset that comprises 4000 images collected from 100 people for experimental purposes. We have also generated the ground truth of each image to authenticate the proposed framework model results. The framework integrated three conditional cGAN networks for sketch-to-image generation, followed by pre-trained ResNet-50, ResNet-101, and Mobile-Net for classification. We use the 3D-Convolutional process for all GANs using vectorization, which extracts more features and texture information from images while using the same computational cost as 2D-Convolution. We used Adam's stochastic gradient descent mechanism to achieve the optimal results with a learning rate of 0.0002 and momentum estimates β1 and β2 as 0.5 and 0.999, respectively, during training. Multiple statistical measures were considered to authenticate the performance of the proposed framework. The framework got 97.33% accuracy with 99% image structure similarity index measure with high SNR and PSNR.
In the future, we will enhance the quality of the generated image using fewer parameters so that high-quality image generation may become possible with low-processing devices. We also try to generate images with the help of textual data.

VI. FUNDING
No funding is available for this study VII. AUTHORSHIP CONTRIBUTION Zakir Khan: Conceptualization; Methodology; Formal analysis; Data curation; Code Execution, Data Collection ZAKIR KHAN received the M.S. degree in computer science from Hazara University Mansehra, Pakistan, where he is currently pursuing the Ph.D. degree with the Department of Computer Science and Information Technology. He is also a Lecturer in computer science and information technology with Hazara University Mansehra. His research interests include artificial intelligence, machine learning, deep learning, medical image processing, steganography, data hiding, and information security. VOLUME 10, 2022 ARIF IQBAL UMAR is currently an Associate Professor with the Department of Computer Science and Information Technology, Hazara University Mansehra, Pakistan. His research interests include data mining, data encryption, neural networks, medical image processing, IT security, algorithms, and machine learning.
SYED HAMAD SHIRAZI is currently an Assistant Professor with the Department of Computer Science and Information Technology, Hazara University Mansehra, Pakistan. His research interests include computer vision, texture analysis, neural networks, object recognition, pattern recognition, digital image processing, machine learning, and wavelet transformation.
MUHAMMAD SHAHZAD received the M.S. degree in computer science from the Virtual University of Pakistan. He is currently pursuing the Ph.D. degree with the Department of Computer Science and Information Technology, Hazara University Mansehra, Pakistan. His research interests include data mining, machine learning, deep learning, medical image processing, and natural language processing.