A Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos

The exponential growth of videos on YouTube has attracted billions of viewers among which the majority belongs to a young demographic. Malicious uploaders also find this platform as an opportunity to spread upsetting visual content, such as using animated cartoon videos to share inappropriate content with children. Therefore, an automatic real-time video content filtering mechanism is highly suggested to be integrated into social media platforms. In this study, a novel deep learning-based architecture is proposed for the detection and classification of inappropriate content in videos. For this, the proposed framework employs an ImageNet pre-trained convolutional neural network (CNN) model known as EfficientNet-B7 to extract video descriptors, which are then fed to bidirectional long short-term memory (BiLSTM) network to learn effective video representations and perform multiclass video classification. An attention mechanism is also integrated after BiLSTM to apply attention probability distribution in the network. These models are evaluated on a manually annotated dataset of 111,156 cartoon clips collected from YouTube videos. Experimental results demonstrated that EfficientNet-BiLSTM (accuracy = 95.66%) performs better than attention mechanism-based EfficientNet-BiLSTM (accuracy = 95.30%) framework. Secondly, the traditional machine learning classifiers perform relatively poor than deep learning classifiers. Overall, the architecture of EfficientNet and BiLSTM with 128 hidden units yielded state-of-the-art performance (f1 score = 0.9267). Furthermore, the performance comparison against existing state-of-the-art approaches verified that BiLSTM on top of CNN captures better contextual information of video descriptors in network architecture, and hence achieved better results in child inappropriate video content detection and classification.


I. INTRODUCTION
The creation and consumption of videos on social media platforms have grown drastically over the past few years. Among the social media sites, YouTube predominates as a video sharing platform with plethora of videos from diverse categories. According to YouTube statistics [1], the global user base of YouTube is over 2 billion registered users and more than 500 hours of video content is uploaded every minute. Consequently, billions of hours of videos are available where users of all age groups can explore generic as well as personalized content [2]. Considering such a large-scale The associate editor coordinating the review of this manuscript and approving it for publication was Aasia Khanum . crowdsourced database, it is extremely challenging to monitor and regulate the uploaded content as per platform guidelines. This creates opportunities for malicious users to indulge in spamming activities by misleading the audiences with falsely advertised content (i.e., video, audio or text). The most disruptive behavior by malicious users is to expose the young audiences to disturbing content, particularly when it is fabricated as safe for them. Children today spend most of their time on the Internet and the YouTube platform for them has distinctly established itself as an alternative to traditional screen media (e.g., television) [3], [4]. The YouTube press release [5] also confirmed the high popularity of this social media site among younger audiences compared to other age groups, and the reason for this high level of approval is due to fewer restrictions [6]. VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Unlike television, children can be presented with any type of content on the Internet due to lack of regulations. Exposing children to disturbing content is considered as one among other internet safety threats (like cyberbullying, cyber predators, hate etc.) [7]. Bushman and Huesmann [8] confirmed that frequent exposure to disturbing video content may have a short-term or long-term impact on children's behavior, emotions and cognition. Many reports [9]- [12] identified the trend of distributing inappropriate content in children's videos. This trend got people's attention when mainstream media reported about the Elsagate controversy [13], [14], where such video material was found on YouTube featuring famous childhood cartoon characters (i.e., Disney characters, superheroes, etc.) portrayed in disturbing scenes; for instance, performing mild violence, stealing, drinking alcohol and involving in nudity or sexual activities.
In an attempt to provide a safe online platform, laws like the children's online privacy protection act (COPPA) imposes certain requirements on websites to adopt safety mechanisms for children under the age of 13. YouTube has also included a ''safety mode'' option to filter out unsafe content. Apart from that, YouTube developed the YouTube Kids application to allow parental control over videos that are approved as safe for a certain age group of children [15]. Regardless of YouTube's efforts in controlling the unsafe content phenomena, disturbing videos still appear [16]- [19] even in YouTube Kids [20] due to difficulty in identifying such content. An explanation for this may be that the rate at which videos are uploaded every minute makes YouTube vulnerable to unwanted content. Besides, the decision-making algorithms of YouTube rely heavily on the metadata of video (i.e., video title, video description, view count, rating, tags, comments, and community flags). Hence, filtering videos based on the metadata and community flagging is not sufficient to assure the safety of children [21]. Many cases exist on YouTube where safe video titles and thumbnails are used for disturbing content to trick children and their parents. The sparse inclusion of child inappropriate content in videos is another common technique followed by malicious uploaders. Fig. 1 displays an example among such cases where video title and video clips are safe for children (as shown in Fig. 1(a)) but included inappropriate scenes in this video (as shown in Fig. 1(b) and Fig. 1(c)). The concerning thing about this example, including many similar cases, is that these videos have millions of views with more likes than dislikes, and have been available for years. Many other cases (as shown in Fig. 1(d)) also identified where videos or the YouTube channel is not popular, yet contains child unsafe content especially in the form of animated cartoons. It is evident from examples that this problem persists irrespective of channel or video popularity. Furthermore, YouTube has disabled the dislike feature of videos which resulted in viewers being incapable of getting the indirect video content feedback from statistics. Since the YouTube metadata can be easily manipulated, it is suggested to better use video features for detection of inappropriate content than metadata features associated with videos [22].
Prior techniques [23]- [28] addressed the challenge of identifying disturbing content (i.e., violence, pornography, etc.) from videos by using traditional hand-crafted features on frame-level data. In recent years, the state-of-the-art performance of deep learning has motivated researchers to employ it in image and video processing. The most frequent applications of image/video classification employed the convolutional neural networks [29]- [31]. Apart from that, the long-short term memory (LSTM), a special type of recurrent neural network (RNN) architecture, has proven to be an effective deep learning model in time-series data analysis [32]. Hence, this study targets the YouTube multiclass video classification problem by leveraging CNN (EfficientNet-B7) and LSTM to learn video effective representations for detection and classification of inappropriate content. We targeted two types of objectionable content geared towards young viewers, one, which contains violence and the second, which includes sexual nudity connotations. The main contributions of this study are threefold: 1. We propose a novel CNN (EfficientNet-B7) and BiLSTM-based deep learning framework for inappropriate video content detection and classification. 2. We present a manually annotated ground truth video dataset of 1860 minutes (111,561 seconds) of cartoon videos for young children (under the age of 13). All videos are collected from YouTube using famous cartoon names as search keywords. Each video clip is annotated for either safe or unsafe class. For the unsafe category, fantasy violence and sexual-nudity explicit content are monitored in videos. We also intend to make this dataset publicly available for the research community. 3. We evaluate the performance of our proposed CNN-BiLSTM framework. Our multiclass video classifier achieved the validation accuracy of 95.66%. Several other state-of-the-art machine learning and deep learning architectures are also evaluated and compared for the task of inappropriate video content detection. To summarize, this work can assist any video sharing platform to either remove unsafe video or blur/hide any portion of video involving unsafe content. Secondly, it may also help in the development of parental control solutions on the web via plugins or browser extensions where children inappropriate content filters automatically. The upcoming sections of the article are outlined as follows: Section II covers the related work in this research area. The methodology of our proposed system is explained in Section III. The experimental setup of the proposed system is presented in Section IV. The results obtained from the experimental setup are analyzed and discussed in Section V, and finally, Section VI concludes the work and directs some future scope for improvements. 16284 VOLUME 10, 2022

II. RELATED WORK
The explosion of multimedia data on YouTube has presented a lot of opportunities for researchers [33]. However, the challenging task is to find an optimal technique to understand the context of videos. In both computer vision and machine learning, video classification is studied extensively as one of the fundamental approaches for video understanding [34]. The problem of detecting inappropriate video falls in the category of video classification or event detection problem.

A. MACHINE LEARNING METHODS
Most of the earlier studies used hand-crafted features on images or video-level data in identifying the discriminative patterns of inappropriate content. The skin (i.e., skin color) [35] and motion information based features were used for nudity or pornography detection [26], [36]. The multimodal approach was also followed by fusing different modalities of data (i.e., audio, video) with skin and motion-based features. Rea et al. [37] proposed a periodicity-based audio feature extraction method which was later combined with visual features for illicit content detection in videos.
The machine learning algorithms are usually employed as classifiers Liu et al. [38] classified the periodicity-based audio and visual segmentation features through support vector machine (SVM) algorithm with Gaussian radial basis function (RBF) kernel. Later on, they extended the framework [39] by applying the energy envelope (EE) and bag-of-words (BoW)-based audio representations and visual features. Ulges et al. [23] used MPEG motion vectors and Mel-frequency cepstral coefficient (MFCC) audio features with skin color and visual words. Each feature representation is processed through an individual SVM classifier and combined in a weighted sum of late fusion. Ochoa et al. [40] performed binary video genre classification for adult content detection by processing the spatiotemporal features with two types of SVM algorithms: sequential minimal optimization (SMO) and LibSVM. Jung et al. [41] worked with the onedimensional signal of spatiotemporal motion trajectory and skin color. Tang et al. [42] proposed a pornography detection system-PornProbe, based on a hierarchical latent Dirichlet allocation (LDA) and SVM algorithm. This system combined an unsupervised clustering in LDA and supervised learning in SVM, and achieved high efficiency than a single SVM classifier. Lee et al. [43] presented a multilevel hierarchical framework by taking the multiple features of different temporal domains. Lopes et al. [44] worked with the bag-of-visual features (BoVF) for obscenity detection. Kaushal et al. [21] performed supervised learning to identify the child unsafe content and content uploaders by feeding the machine learning classifiers (i.e., random forest, K-nearest neighbor, and decision tree) with video-level, user-level and comment-level metadata of YouTube Reddy et al. [45] handled the explicit content problem of videos through text classification of YouTube comments. They applied bigram collocation and fed the features to the naïve Bayes classifier for final classification.

B. DEEP LEARNING METHOD
In contrast to machine learning algorithms, there is a growing trend of using deep learning architectures to learn the video-based feature representations in video classification. The study of Ngiam et al. [46] is the first deep learning-based research that processed different data modalities such as video, image, audio and text, and performed greedy layer-wise training of restricted Boltzmann machine (RBM) model. Karpathy et al. [29] showed multiclass video classification results on a large-scale video dataset (ImageNet) by using the convolutional neural network. This study is followed by the research of Yue-Hei Ng et al. [32] in which information across the full-length duration of videos is examined. The LSTM model is employed on top of frame-level CNN activations for better video classification. Some other studies have also taken the benefits of CNN-LSTM model to capture the context of a given sequence of video frames [47]. Wu et al. [31] presented the CNN-LSTM hybrid model for obtaining the short-term spatial-motion patterns through CNNs and long-term temporal clues by employing LSTMs Simonyan and Zisserman [30] built a two-stream CNN architecture for action detection from videos. Wehrmann et al. [48] reported the best accuracy results with CNN-LSTM than other baselines on the NPDI pornography dataset [24]. Perez et al. [49] classified pornography videos by deploying CNNs with static and motion information (i.e., MPEG motion vectors). This study yielded the same accuracy scores when static and motion information are combined by late fusion or mid-level fusion (96.4%), but the performance degraded in early fusion (90.5%) Aldahoul et al. [50] explored and compared different stateof-the-art deep learning approaches and found that the pretrained CNN model (EfficientNet-B7) based features with SVM classifier performed better in unsafe video detection.
In the literature, many studies explored different YouTube data modalities (i.e., text, audio, image, and video) individually for inappropriate content detection. Yenala et al. [51] proposed a novel CNN-BiLSTM model for automatic detection and filtering of the inappropriate text in query suggestions of YouTube search. Trana et al. [52] investigated naïve Bayes, SVM and CNN classifiers for harassment detection in YouTube comments Dadvar and Eckert [53] identified cyberbullying in YouTube comments by experimenting with four deep learning models (i.e., CNN, LSTM, BiLSTM and BiLSTM with attention) using GloVe, SSWE and random word embeddings. In this study, the BiLSTM with an attention mechanism performed better than conventional machine learning algorithms. Mohaouchane et al. [54] performed an automatic detection of YouTube offensive comments using the CNN-BiLSTM model. Alshamrani [55] used an ensemble classification model to detect the age-inappropriate comments posted on YouTube videos. Later on, they extended the work [17] by leveraging natural language processing in an ensemble classifier. They detect five age-inappropriate remarks (toxic, absence, insult, threat and identity hate) which appear in YouTube comments Alshamrani et al. [56] applied a neural network model on YouTube comments for toxicity detection and an LDA model on video captions for topic modeling.
The combination of different YouTube modalities is also explored in deep learning architectures. Mariconti et al. [57] showed an ensemble method for the detection of proactive remarks in YouTube videos. Features from three different sources (video metadata, audio transcripts and thumbnail) are extracted and fed to individual classifiers such as metadata to SVM with linear kernel, thumbnails to the random forest and audio transcripts to RNN-based gated recurrent unit (GRU) classifier Hou et al. [10] recognized the bloody videos by combining the audio-visual features and passed them to the CNN-LSTM model. Ali and Senan [11] also explored the audio-visual features of YouTube videos with the DNN model for violence classification. Sumon et al. [58] proposed the CNN-LSTM model for identifying the violent crowd flow in YouTube videos Alghowinem [59] showed a multimodal approach for inappropriate content filtering of the YouTube Kids application. Papadamou et al. [20] studied an Elsagate phenomenon in the YouTube Kids application by using YouTube video metadata such as video title, tags, statistics (likes/dislikes, view count, comments count), style features (same as proposed in [21]) and thumbnails. The authors deployed a fully connected neural network model for statistics and style features, CNN for thumbnail features, and LSTMs for video title and tags features. Vitorino et al. [60] explored transfer learning and fine-tuning with CNN for sexually exploitative imagery of children (SEIC) detection. Ishikawa et al. [61] proposed an end-to-end pipeline for combating the Elsagate content in YouTube videos. They evaluated the classification performances of mobile-platform based different deep learning architectures such as GoogLeNet, SqueezNet, NASNet, and MobileNetV2 Tahir et al. [22] processed multimodal data with the CNN-LSTM framework for disturbing and fake embedded content in videos. Lastly, Singh et al. [62] performed a fine-grained approach for child unsafe video content detection. They employed the LSTM autoencoder to learn video representations from CNN (VGG-16) feature descriptors.
The abundance of literature is available for inappropriate video content detection by using hand-crafted or deep learning feature extraction techniques. These studies have performed binary classification to classify the videos as safe or unsafe. However, it lacks the research of detecting or classifying different categories of disturbing content in real-time YouTube videos, particularly in children-oriented videos. Secondly, the existing studies have not explored the BiLSTM network for inappropriate video content detection. This paper addresses the aforementioned problems by working with EfficientNet-B7 and BiLSTM-based deep learning framework for children inappropriate video content detection and classification.

III. PROPOSED METHODOLOGY
The proposed methodology provides a system to resolve the problem of disturbing content in videos. This work employs deep learning architecture which has already been applied successfully in several applications for video classification problems. As shown in Fig. 2, the proposed system is divided into three main modules, namely (1) video preprocessing, (2) deep feature extraction, and (3) video representation and classification. In the video preprocessing stage, the collected YouTube videos are preprocessed to remove all irrelevant or missing video information. It also rescales the extracted frames of each video clip into fixed dimensions (224 × 224). The preprocessed video frames of each video clip are forwarded as an input to an ImageNet pre-trained EfficientNet-B7 model for feature extraction. The extracted features are experimented with the BiLSTM network to learn effective video representations, which subsequently passed to the fully connected and softmax layers for final video classification. Furthermore, the detailed descriptions of each step are explained in the following subsections.

A. VIDEO PREPROCESSING
Video preprocessing plays an important role in deep learning techniques, as it helps in acquiring the relevant features for better video classification. In this work, a video (V i ) is first represented as N small segments of videos referred to as video clips (c i 1 , c i 2 , . . . ., c i N ) with one-second length each. These video clips are labeled through a manual annotation process where all clips with incomplete information or video context are ignored. After splitting and labeling of video clips, it is noticed that the initial 3-4 frames of each video clip have a piece of information from the previous clip of the same video. Therefore, considering an average video frame rate of 23-24 frames per second (fps), each of the j th video clip (c i j ) is sampled at the frame rate of 22 fps by ignoring some initial frames. The last frame is duplicated for all video clips containing fewer frames than an average frame rate of a video. Overall, the frames are represented as f i,j Finally, the selected frames of each video clip are rescaled to a fixed resolution of 224 × 224 pixels that correspond to the input size of the pre-trained convolutional neural network model.

B. DEEP FEATURE EXTRACTION
In this module, the deep features of preprocessed video frames are extracted by using a deep learning model with advanced architecture. Instead of training an entire CNN model from scratch, this study employed a pre-trained CNN architecture known as EfficientNet for the extraction of visual representations from video frames.

1) EFFICIENT-NET
The EfficientNet model is a convolutional neural network model and scaling method that uniformly scales network depth, width and resolution through compound coefficient. It is trained on a large-scale ImageNet dataset with 1.3 million images from 1000 object classes [63]. Tan and Le [64] reported state-of-the-art accuracy of EfficientNet on ImageNet dataset with much smaller and faster inference than best existing CNN models. The baseline network of EfficientNet is referred to as B0, whereas other scaling networks include B1, B2, B3, B4, B5, B6 and B7 respectively. In general, all scaling networks show better accuracy but with the cost of FLOPS.
The proposed framework included EfficientNet-B7 by working with preprocessed extracted frames (f j as an input. The EfficientNet module performed feature extraction with the transfer learning technique in which each input frame with dimensions of 224×224×3 (image_width x image_height x RGB_channel) is processed through a stack of 813 layers. The last three layers including the fully connected layer generating 1000 ImageNet output labels are discarded, which makes the EfficientNet-B7 to generate an output of feature descriptors X i,j k with 7 × 7 × 2560 dimensions for each frame f i,j k . These feature descriptors are used as an input to the BiLSTM model for video representation and classification.

C. VIDEO REPRESENTATION AND CLASSIFICATION
The third stage of pipeline trained bidirectional LSTM network with supervised learning so that effective video representations can be learnt from feature descriptors of video clips. Subsequently, the proposed system is added with two fully connected layers for acquiring the final video classification results.

1) BiLSTM NETWORK
The recurrent neural networks produce good network performance in modeling the hidden sequential patterns of time-series data. However, the vanishing gradient problem hampers an update of network parameters during the backpropagation process. It is usually resolved by using two variations of RNNs, which are: LSTM and gated recurrent unit (GRU). Conceptually, the network structure of LSTM is as same as RNN, but a special unit ''memory cell'' is introduced in LSTM to replace the update process of RNN. The memory cell of LSTM maintains information for a longer duration of time. Considering the current input vector x t , the last hidden VOLUME 10, 2022 FIGURE 2. The proposed framework architecture works into three stages for children inappropriate video content detection and classification. In the first stage, video clips are preprocessed to discard irrelevant video frames and transform the selected frames into fixed-size images (224,224,3). Next, the frames are fed to EfficientNet-B7 to get feature vectors. All feature vectors are reshaped and passed into the two-layer stack of BiLSTMs for video representation. A fully connected layer followed by an output layer with softmax activation is integrated to return the probabilities of each video clip against the three classes including safe (class 0), fantasy violence (class 1) and sexual-nudity (class 2).
state h t−1 , and the last memory cell state c t−1 , the following equations are used to implement an LSTM model: where, σ represents the sigmoid activation function; i, f , c, and o denote input gate, forget gate, memory cell state and output gate at time t, respectively. W and b denote the weights and bias vector. Considering the video classification problem, one potential drawback of LSTM is that it captures the past context only. For getting the full context of any video, it is important to consider both directions i.e., past and future context of the video. Therefore, the bidirectional LSTM appears to be a suitable option in video classification as it preserves the information in both directions, as shown in Fig. 3. In BiLSTM, there are two distinct hidden layers referred to as forward hidden layer (h It is noticed that adding an excessive number of layers of BiLSTM increases the network complexity and slows down the training process. Hence, this work employed two layers of BiLSTM to understand the video representations.

2) BiLSTM NETWORK WITH AN ATTENTION MECHANISM
A neural network architecture with an attention mechanism decides when to look into data (or in this case, segments of videos) by automatically giving a high level of focus to feature vectors with most valuable information than the feature vectors with less valuable information. The architecture of BiLSTM with an attention mechanism is depicted in Fig. 4. Considering the final hidden state of i-th BiLSTM as h it , which is computed as: Then, the attention mechanism is computed by using the following equations: The attention mechanism assigns attention weight a it to the i-th BiLSTM output vector at time t, as calculated in equation 11. W a and b a represents the weight and bias from the attention layer. Finally, the output from attention layers generates an attention vector v t , which is calculated as a weighted sum of the multiplication between attention weight a it and i-th BiLSTM output vector at time t.

3) SOFTMAX CLASSIFIER
The softmax activation function in an output layer of any deep learning model is considered as a softmax classifier.
To classify each video clip into one of three classes (i.e., fantasy violence, sexual-nudity and safe), the proposed model integrated a softmax activation function in the last fully connected layer to determine the relative probability of three output units. The softmax activation function (σ ) is calculated as:

D. OVERALL PROPOSED FRAMEWORK
The general architecture of the proposed framework for multiclass video classification is illustrated in Fig. 2. The input is a sequence of 22 frames of fixed resolution with 224×224×3 pixels. These frames are processed through EfficientNet-B7 for extracting features and generating the feature vector of 22×7×7×2560 shape. The high-level features are reshaped and directed towards the two-layer stack of BiLSTM network. The flattening layer is added to transform the feature representations into a 1-dimensional vector. Subsequently, a fully connected layer (or dense layer) of 4096 neurons with rectified linear unit (ReLU) activation function is added. As a fully connected layer generates a wide range of probabilities by connecting all inputs of one layer to every activation unit of the next layer; therefore, a dropout of 0.3 is carried out to prevent this model from an overfitting problem. Finally, the softmax output layer with 3 neurons gives the final classification scores. The algorithmic steps for training and validation of the proposed framework are presented in Algorithm 1.

IV. EXPERIMENTAL SETUP A. DATASET
YouTube has a huge collection of videos and metadata of videos (i.e., likes, dislikes, view count, comments, etc.) that has been explored in numerous researches.  [72], MSR-VTT [73]). However, none of these existing benchmarks aims for the proposed video classification problem. The datasets closely related to our problem are the NPDI cartoon dataset [50], the Elsagate dataset [61], and the dataset of Singh et al. [62]. Comparatively, the NPDI dataset is the smallest with 900 images only and is not suitable to perform our deep learningbased video classification task. The Elsagate dataset is a publicly available dataset of cartoon videos from sensitive and non-sensitive classes. In this dataset, whole video is considered either safe or unsafe where the clean frames of video are also labeled as unsafe. Secondly, it lacks the complex behaviors of sensitivity content. The videos in this dataset are targeted for toddlers. Lastly, the dataset of Singh et al. [62] included Japanese anime videos dedicated for mature viewers. For this reason, a manually annotated video dataset is presented for identifying the disturbing content. Because the intention is to focus on content for children, this study included cartoon videos only. The videos are searched and collected using the four popular cartoon names (including Tom and Jerry, gravity falls, Simpsons and Sponge Bob) and ''cartoon'' as keywords through YouTube Data API. Once the list of videos is obtained and downloaded, next step involved video filtering in which all irrelevant videos (like non-cartoon) are discarded. The process of filtering resulted into 1126 videos with duration range between 2 to 600 seconds long. All collected videos are split into onesecond duration clips using FFmpeg. Each video clip is manually annotated as belonging to either safe, fantasy violence or sexual-nudity class. The clips with any other act (i.e., extreme bloodshed or violence, smoking, use of drugs, frightening or horror scenes, etc.) are not included in this dataset.
Validate the model Model epoch . 6. Calculate accuracy, precision, recall and F1 score using the confusion matrix conf _matrix epoch . for for allc i v k (i = 1, 2, . . . , n 2 ) do 16 . We also intend to make this dataset publicly available for research community. Table 1 summarizes the overall distribution of manually annotated cartoon videos according to three classes.

B. NETWORK HYPERPARAMETERS TUNING 1) FRAME SAMPLING
Due to frame overlap issue, each video clip is sampled at a frame rate of 22 fps by ignoring some of the starting frames. Video clips containing frames less than an average frame rate (23-24 fps) are padded with the last frame of same clip. The frame sampling rate of 22 fps is adopted in all training, validation and testing experiments of neural network models.

2) BILSTM PARAMETER
Considering an ImageNet pre-trained CNN model is employed for video frames feature extraction by using transfer learning approach, the hyperparameters tuning in proposed framework is required for BiLSTM and subsequent fully connected (dense) layer. Various design choices of bidirectional LSTM layers are evaluated, but it is confirmed through experiments that adding two simultaneous layers of bidirectional LSTMs perform better in video classification than working with single or multiple layers of bidirectional LSTM networks. Apart from that, the different number of hidden units (i.e., 64,128,256,512) in each bidirectional LSTM layer are embedded. For consistency, the same number of hidden units are used in both bidirectional layers. Experiments showed that the highest validation accuracy is achieved by using 128 hidden units in two BiLSTM network layers. It is also noticed that the fully connected (FC) layer of 4096 units with ReLU activation function helps in selecting the most relevant and appropriate labels in classification layer than a fully connected layer of 2048 units with same activation function. To avoid the problem of overfitting, the dropout layer (value = 0.3) is added before the last fully connected output layer (dense layer). Finally, the softmax classifier is applied in an output layer of 3 units to get the final probability scores. Table 2 enlists the complete layer configuration of our bestproposed model used in inappropriate video content detection and classification. This model has nine layers with each containing a different number of output size and learnable parameters. Overall, the proposed model is trained with 152 million parameters (number of neurons) that are updated during the backpropagation process. All the parameters of the pre-trained EfficientNet-B7 model are non-trainable parameters which means that these parameters are not optimized during model training.

3) COST FUNCTION AND OPTIMIZER PARAMETERS
The cost function measures an error between predicted and actual values. The optimizer function is responsible for reducing an error or overall loss of neural network model to improve the model accuracy. In this study, the categorical cross-entropy loss function is used for multiclass video classification, which is calculated as: In equation (14), c represents a particular class index from N number of classes, y i,c is the binary indicator (0 or 1) which indicates whether c is the actual class for instance i, y i,c represents predicted probability of instance i for class c. Adam optimizer, the extension of stochastic gradient descent (SGD) algorithm, is used with a learning rate of 1e-5 to minimize the error of cost function in proposed model.

4) TRAINING PARAMETER
Because of memory and computational constraints, it is found convenient to load the training dataset in memory by taking the small subsets of data for model training.

C. EXPERIMENTAL ENVIRONMENT
The proposed model is implemented in Python programming using Keras, which is a high-level deep learning application programming interface (API) that runs on top of Google's TensorFlow open-source library [74], [75]. The Google Colab integrated development environment (IDE) is used which offers a cloud-based Jupyter notebook to write and run the Python codes for deep learning architectures. For all experiments, the Google Colab Pro+ is used to train the model, which offers 50 GB of T4 or P100 GPU and 255 GB of disk. Moreover, the Google drive with 2 TB of space is used to store the YouTube cartoon dataset. This dataset is partitioned VOLUME 10, 2022 with an 80:20 split such that 80% of the data is allocated for training and 20% for evaluation and testing of models.

D. EVALUATION METRICS
The performances of multiclass video classification models are evaluated by calculating the accuracy, precision, recall and f1 score using confusion matrices. Accuracy is the ratio of number of correct predictions for each class to the total number of predictions of all classes, and is calculated as: (15) In equation (15), c represents a particular class index from N number of classes, T p denotes the true positives, T N denotes true negatives, F p denotes false positives and F N denotes false negatives. Precision is the ratio of total number of correct predictions of positive instances to the total number of predictions with positive instances. It is calculated as: The recall (also known as sensitivity) is the ratio of total number of correct predictions of positive instances to the total number of instances in an actual class. The recall and f1 score are calculated by using equations (17) and (18):

V. RESULTS AND DISCUSSION
In this section, the results obtained through experimental evaluations of different machine learning and deep learning approaches for video classification are presented and discussed. Afterwards, the best-proposed approach is compared with existing state-of-the-art methods from literature.

A. ANALYSIS OF PRE-TRAINED CNN MODEL VARIANTS
At first, three pre-trained convolutional neural network models including Inception-V3, VGG-19 and EfficientNet-B7 are employed as video classifiers to determine the performances of these ImageNet pre-trained CNN models in our multiclass video classification problem. For each model, the last three layers of the pipeline are discarded and added with a fully connected layer of softmax activation function using three output nodes. The transfer learning approach is implemented in a manner where weights of all layers in the model are fixed except the last fully connected layer. After training each pre-trained convolutional neural network model using the transfer learning approach, as shown in Table 3, it is analyzed that the EfficientNet-B7 model performs comparatively better than VGG-19 and Inception-V3 on the YouTube cartoon video dataset. It has achieved the highest recall score which means that the EfficientNet-B7 model retrieves more relevant instances than the remaining two pre-trained CNN models. Hence, further experiments are carried out with the EfficientNet-B7 as a base classifier.

B. ANALYSIS OF EFFICIENT-NET FEATURES WITH DIFFERENT CLASSIFIER VARIANTS
In this section, the performances of different classifiers trained on EfficientNet visual features are evaluated. For this purpose, some machine learning algorithms are considered for the video classification task. The experimental evaluation of Xu et al. [76] also presented that even a simple machine learning algorithm can play an effective role in video classification considering the features are distinctive enough.
This study applied three machine learning algorithms namely SVM, KNN and random forest as video classifiers by training on EfficientNet features. The evaluation results of Table 4 shows that among three machine learning classifiers, SVM with RBF kernel achieved the highest accuracy of 72.48% on EfficientNet visual features. It is followed by the random forest and KNN (neighbors = 3) classifiers with accuracy values of 68.69% and 60.12%, respectively. The other evaluation metrics i.e., precision, recall and f1 score of SVM with RBF kernel outperformed the random forest and KNN classifiers. Apart from machine learning classifiers, an experiment is performed where EfficientNet-B7 itself is treated as a sole classifier by replacing the last three layers of architecture with a fully connected layer (units = 512) followed by an output layer (activation = softmax) of three units. Two main methods such as transfer learning and fine-tuning of EfficientNet-B7 are implemented for the video classification task. In transfer learning, the weights of all layers of EfficientNet are fixed except the last fully connected layer and fine-tuning updates the weights of an entire model.
From Table 4, it can be observed that the performances of all machine learning algorithms are relatively poor than transfer learning or fine-tuning of the EfficientNet-B7 model. In comparison, the EfficientNet model using transfer learning approach performed slightly better (accuracy = 89.07%) than fine-tuned model (accuracy = 87.89%). The main reason for such model behavior is because the ImageNet video classification dataset is much larger in scope (14 million images) than our self-curated cartoon video dataset (2.5 million video frames) used for fine-tuning of the EfficientNet model. By further examining the evaluation results of other classifier variants, it is noticed that although EfficientNet with transfer learning method yields best results than other classifiers, it still has a high ratio of false negatives (recall = 79.54%). Hence, model training by using a single fully connected layer with pre-trained CNN architecture is not sufficient. It requires some deep neural network classifier to effectively understand the hidden sequences of video representations by returning high precision-recall values for video classification. Thus, further experiments are conducted using EfficientNet-B7 as a feature extractor with other neural network models to not let any child inappropriate content go undetected in video classification.

C. ANALYSIS OF EFFICIENT-NET WITH BiLSTM AND ATTENTION-BASED BiLSTM CLASSIFIER VARIANTS
The experiments of previous sections revealed that ImageNet pre-trained EfficientNet-B7 works better as a feature extractor and this architecture in conjunction with any deep learning algorithm can successfully detect and classify unsafe video content. The bidirectional LSTM, a supervised deep learning algorithm, is opted for developing a deep learning-based framework because it preserves the contextual information in both directions of time-series data, which appears to be a suitable choice in our video classification problem. The experiments are conducted using the two-layer stack of BiLSTMs followed by fully connected (units = 4096, activation = ReLU), drop out (value = 0.3) and softmax (output units = 3) layers. Details of the complete architecture are mentioned in section III. This study implemented and evaluated different hidden units (i.e., 64, 128, 256, and 512) in each BiLSTM layer. For simplicity and consistency, the same number of hidden units are used in both layers of BiLSTM networks. You and Korhonen [77] reported that adding an attention     Table 5 demonstrates the experimental results of attention and without attention mechanism-based EfficientNet-BiLSTM models by working with the different number of hidden units (i.e., 64, 128, 256, and 512) in each bidirectional LSTM layer of the proposed framework.
The first observation from all evaluation results, as mentioned in Table 5, is that all EfficientNet-BiLSTM networks perform comparatively better than the attention-based EfficientNet-BiLSTM networks. For attention-based models, the f1 scores are improved by updating the hidden units from 64 to 128 and 256 in each BiLSTM network as it affects the network trainable parameters during backpropagation. However, it is also found that adding an excessive number of hidden units (i.e., units = 512) gradually decreases the overall network performance. Secondly, the overall behavior of attention mechanism-based neural network models is different from models with no attention blocks. In the EfficientNet-BiLSTM network, upgrading the hidden units in BiLSTMs from 64 to 128 immediately resulted in the best performing model of all experiments by showing the highest f1 score (0.9267). However, it drastically decreases the performance by adding more hidden units in BiLSTMs (i.e., 256 and 512). A detailed performance comparison between the EfficientNet with attention and without attention mechanism-based BiLSTM models are presented through confusion matrices for all three classes. The diagonal values represent the correctly classified number of instances in each class, but anything off the diagonal indicates incorrect classification instances. The evaluation results of the EfficientNet-BiLSTM model with and without attention blocks for the YouTube cartoon video dataset are illustrated in Fig. 5. Overall, the EfficientNet and BiLSTM network with 128 hidden units in each bidirectional layer achieved the highest validation (f1 score = 0.9274) and testing scores (f1 score = 0.9267).

D. PERFORMANCE COMPARISON WITH EXISTING STATE-OF-THE-ART CLASSIFICATION METHODS
We compare the performance of the proposed EfficientNet-BiLSTM model with existing state-of-the-art models and methods employed for inappropriate content classification using different YouTube data modalities. Table 6 summarizes the results and quality scores of existing and proposed classification methods. It is worth noting that existing studies explored different YouTube modalities (i.e., text, audio, video, and metadata) for different classifications. The most common strategy in existing studies, for unsafe content classification, is using pre-trained CNN models with either LSTM-based classifiers [10], [20], [22], [48], [62] or machine learning-based classifiers [50], [60], [61]. Compared with the approaches that use pre-trained CNN features with machine learning classifiers, our EfficientNet-BiLSTM classifier method yielded higher accuracy than GoogLeNet-SVM [60], fine-tuned NASNet-SVM [61], and EfficientNet-SVM [50] approaches by significant margins of 3.06%, 7.79%, and 10.1%, respectively. In comparison with base models using pretrained CNNs and LSTMs, the Inception-V3 with LSTM approach [20] reported f1 score of 0.828 which is much lower than our with attention (f1 score = 0.9195) and without attention-based (f1 score = 0.9267) BiLSTM classifier variants. It is also worth mentioning that the ResNet-LSTM model in existing studies [10], [48] attained comparable accuracy results to our proposed technique. It can be explained by the fact that the studies reporting these approaches performed binary video classification which is much simpler than multiclass video classification. Note that the proposed model still outperformed some existing approaches of multiclass video classification using VGG-LSTM-based models [22], [62], which shows that BiLSTM has high robustness on time-series data modeling. In addition, some studies [11], [17], [21] used simple convolutional neural networks and reported the lowest classification accuracy and f1 scores. Hence, it is deduced that simple CNNs are not sufficient to understand the complexities of YouTube data modalities. Overall, the performance comparison showed that the proposed EfficientNet using BiLSTM (hidden units = 128) surpassed the existing studies in inappropriate video content detection and classification.

VI. CONCLUSION AND FUTURE WORK
In this paper, a novel deep learning-based framework is proposed for child inappropriate video content detection and classification. Transfer learning using EfficientNet-B7 architecture is employed to extract the features of videos. VOLUME 10, 2022 The extracted video features are processed through the BiLSTM network, where the model learns the effective video representations and performs multiclass video classification. All evaluation experiments are performed by using a manually annotated cartoon video dataset of 111,156 video clips collected from YouTube. The evaluation results indicated that proposed framework of EfficientNet-BiLSTM (with hidden units = 128) exhibits higher performance (accuracy = 95.66%) than other experimented models including EfficientNet-FC, EfficientNet-SVM, EfficientNet-KNN, EfficientNet-Random Forest, and EfficientNet-BiLSTM with attention mechanism-based models (with hidden units = 64, 128, 256, and 512). Moreover, the performance comparison with existing state-of-the-art models also demonstrated that our BiLSTM-based framework surpassed other existing models and methods by achieving the highest recall score of 92.22%. The advantages of the proposed deep learning-based children inappropriate video content detection system are as follows: 1) It works by considering the real-time conditions by processing the video with a speed of 22 fps using EfficientNet-B7 and BiLSTM-based deep learning framework, which helps in filtering the live-captured videos.
2) It can assist any video sharing platform to either remove the video containing unsafe clips or blur/hide any portion with unsettling frames. 3) It may also help in the development of parental control solutions on the Internet through plugins or browser extensions where child unsafe content can be filtered automatically. Furthermore, our methodology to detect inappropriate children content from YouTube is independent of YouTube video metadata which can easily be altered by malicious uploaders to deceive the audiences. In the future, we intend to combine the temporal stream using optical flow frames with the spatial stream of the RGB frames to further improve the model performance by better understanding the global representations of videos. We also aim to increase the classification labels to target the different types of inappropriate children content of YouTube videos.