Deep Learning-Based Detection of Inappropriate Speech Content for Film Censorship

Audible content has become an effective tool for shaping one’s personality and character due to the ease of accessibility to a huge audible content that could be an independent audio files or an audio of online videos, movies, and television programs. There is a huge necessity to filter inappropriate audible content of the easily accessible videos and films that are likely to contain an inappropriate speech content. With this in view, all the broadcasting and online video/audio platform companies hire a lot of manpower to detect the foul voices prior to censorship. The process has a large cost in terms of manpower, time and financial resources. In addition to inaccurate detection of foul voices due to fatigue of manpower and weakness of human visual and hearing system in long time and monotonous tasks. As such, this paper proposes an intelligent deep learning-based system for film censorship through a fast and accurate detection and localization approach using advanced deep Convolutional Neural Networks (CNNs). The dataset of foul language containing isolated words samples and continuous speech were collected, annotated, processed, and analyzed for the development of automated detection of inappropriate speech content. The results indicated the feasibility of the suggested systems by reporting a high volume of inappropriate spoken terms detection. The proposed system outperformed state-of-the-art baseline algorithms on the novel foul language dataset evaluation metrics in terms of macro average AUC (93.85%), weighted average AUC (94.58%), and all other metrics such as F1-score. Additionally, proposed acoustic system outperformed ASR-based system for profanity detection based on the evaluation metrics including AUC, accuracy, precision, and F1-score. Additionally, proposed system was proven to be faster than human manual screening and detection of audible content for films’ censorship.


I. INTRODUCTION
With the increased exposure to portable and immediate screen 23 time sources such as televisions, computers and smartphones, 24 filtering of audio and visual contents is becoming crucial. 25 This is because media commonly include offensive and sen- 26 sitive contents, e.g., foul languages, nudity, and sexually 27 explicit contents, which could attract the attention of users 28 The associate editor coordinating the review of this manuscript and approving it for publication was Kathiravan Srinivasan . in entertainment videos, games and movies available through 29 broadcasting channels or at online platforms. Tuttle [1] stated 30 that most movies incorporate the usage of profanity that could 31 negatively affect the society [2] and that she believed that 32 this frequency would increase over the years. Broadcasting 33 companies and media-sharing platforms are responsible in 34 ensuring the appropriateness of contents shared to the public 35 through their respective channels. In the case of language, 36 censorship is a complex filtering process that provides lan-37 guage content appropriate to consumers due to the restrictions 38 One of the earlier methods of application of KWS involved 95 the usage of large-vocabulary continuous speech recognition 96 (LVCSR) systems [13], [14]. Such systems were deployed to 97 decode speech signal to allow keyword to be identified in the 98 generated lattices (i.e., in the phonetic units' representations 99 of different sequences, given the speech signal, were likely 100 sufficient). This approach is superior in the sense that it allows 101 flexibility to handle changing or non-predefined keywords 102 [15], [16], [17] (although often with performance drop when 103 keywords are out of vocabulary [18]). 104 The main weakness of LVCSR-based KWS systems lies in 105 the computational complexity dimension. Specifically, these 106 systems require high computational resources in order to 107 generate complex lattices [16], [19], which introduces latency 108 [20]. Therefore, this approach is not suitable for the applica-109 tion of real time speech recognition and monitoring. For the 110 application of voice assistants and machine wake-up words, 111 the high computational resource and memory requirements 112 also place constraints on the usage of LVCSR systems [19], 113 [21], [22]. 114 As deep learning techniques mature over the years, usages 115 of deep spoken KWS systems [23], [24], [25], [26] have 116 increased due to progressively improving performance in 117 terms of efficiency and accuracy, in voice assistants for 118 instance. The sequence of word posterior probabilities gen-119 erated by deep neural networks is processed to identify the 120 possible existence of keywords directly without intervention 121 of any Hidden Markov Model (HMM) or Gaussian Mixture 122 Model (GMM). This deep KWS method has been attracting 123 attention due to flexible complexity of DNN generating the 124 posteriors, or acoustic model, which is dependent on compu-125 tational resource availability [27], [28], [29]. 126 Deep spoken keyword spotting system [30], [31], [32] 127 typically contains three main blocks [9]: 1) the speech feature 128 extractor that converts the input signal to a compact speech 129 representation, 2) the deep learning-based acoustic model 130 that generates posteriors over the keyword and filler (non-131 keyword) classes based on the speech features, and 3) the 132 posterior handler that processes the temporal sequence of 133 posteriors to determine the possible existence of keywords 134 in the input signal. 135 Mel-scale-related features, low-precision features, learn-136 able filter-bank features, and other features are the most rele-137 vant speech features used in deep KWS systems [9]. Speech 138 features based on the perceptually-motivated Mel-scale filter-139 bank, e.g., log-Mel spectral coefficients and Mel-frequency 140 cepstral coefficients (MFCCs), have been commonly utilized 141 in the areas of ASR and KWS. Despite the many attempts 142 to learn optimal, alternative representations from speech sig-143 nals, Mel-scale-related features is still a safe, solid, and com-144 petitive choice to date [33]. 145 In most deep KWS systems, both types of speech features 146 are normalized to have zero mean and unit standard deviation 147 prior to being input to the acoustic model in order to stabilize 148 and accelerate training and improve model generalization 149 [34]. The most employed speech feature type in deep KWS 150 MFCCs with temporal context are used in [34], [35], [36], 152 [37], and [38]. Particularly, application of discrete cosine  [56] prior to processing. Next, 208 smoothed word posteriors are commonly utilized to make 209 the decision of whether a keyword is present, either through 210 comparison with a sensitivity threshold [57] or by selecting 211 the class with highest posterior within a time sliding win-212 dow [58]. One disadvantage of streaming mode processing 213 is that false detection may occur when the same keyword 214 realization is detected more than once in the smoothed pos-215 terior sequence as consecutive input segments may cover 216 parts of the same keyword realization. Post processing tech-217 nique would need to be employed in order to avoid this 218 problem [26].

219
The current trend involves usage of KWS for voice activa-220 tion voice assistants [59] and Voice Control of Hearing Assis-221 tive Devices [54]. Hence, the literature on automated speech 222 recognition models using deep learning techniques mostly 223 revolved around inoffensive language identification only. 224 For instance, conversational and read speech dataset clear 225 of profane language utterances such as LibriSpeech [60], 226 Google's voice search traffic dataset [61], Google commands 227 dataset [52], spoken digits dataset [62], and speech emotions 228 dataset of conversational speech dialogues [63], [64] have 229 been explored in recent years.

230
In 2020, [65] researched on the efficiency of foul lan-231 guage detection using pre-trained CNNs (e.g., Alexnet and 232 Resnet50). The proposed solutions had inaccurate detection 233 and high computational cost due to large number of net-234 work parameters, causing the system to fail to meet the 235 requirements for real time usages, i.e., real time monitoring 236 for profanity filtering in videos. Another work studied the 237 categorization of isolated foul words versus isolated nor-238 mal speech using a novel foul language dataset. Despite 239 the acceptable performance on the tested dataset, the detec-240 tion and localization performances within audio samples of 241 the proposed methods (CNN and RNN) on other dataset 242 consisting of conversational speech of continuous audios 243 were not explored [66], [67]. In brief, the feasibility of 244 spoken profanity detection and localization within audio 245 files has not been proven for real time audio filtering 246 applications.

247
This experiment was carried out on English profanities 248 and its derivatives. The model utilizes the acoustic features 249 of profanities for the purpose of detecting profane words 250 and localize it within a continuous audio sample, unlike 251 Automatic speech recognition (ASR) models that transcript 252 any spoken words based on the language model that are 253 used as a part of the whole ASR system. However, the use 254 of ASR systems requires huge computational cost for the 255 use of a large dataset. Furthermore, ASR systems consist 256 of several sequenced stages including acoustic models and 257 language models. In the scenario of detecting and localizing 258 inappropriate speech content within a continuous audio input, 259 requires an additional text detection model. Consequently, 260 ASR-based systems for the detection of profanities suffers 261 of latency. Additionally, the use of sequenced models could 262

318
The datasets utilized in this study of English profanities are 319 described in this section. Next, the methodology is explained 320 in detail. Firstly, feature extraction process in Log-Mel spec-321 trogram methods applied on raw audio samples is performed. 322 Secondly, E2E CNN is used for feature learning. Thirdly, 323 posterior handling methods are done for further processing. 324 A short review of each method and its function are summa-325 rized in the following subsections.  TAPAD dataset was augmented to increase the number of 425 samples eight times from 4511 foul sample to 36088 foul 426 samples to enhance the models' robustness to noise, avoid 427 models' over-fitting, and improve models' generalization and 428 reduce. The augmented dataset was then used to train pro-429 posed and baseline models. The augmentation was performed 430 using the same approaches used for MMUTM dataset that are 431 described in the previous part.

433
This dataset is a novel challenging database that are only used 434 for testing and model's evaluation purposes. This data con-435 sists of six real-world audios that were retrieved from videos 436 available on the internet, four of the samples were retrieved 437 from YouTube videos, while the other two are a full films. 438 Full films are used in the evaluation as this research designed 439 to propose a solution for films to provide real time moni-440 toring and censorship for the inappropriate speech content. 441 As described in Table 1. The total length of the testing videos 442 is about four hours, seven minutes, and nineteen seconds, 443 which is ∼ 247.32 minutes in total. It is obvious that the 444 testing dataset intensively consist of foul languages within the 445 normal conversation speech, as the dataset consist of 1322 446 profanity, where all the profanities are also existed in the 447 training dataset of MMUTM and TAPAD dataset.

448
The rate of foul words per minute is what makes this dataset 449 to be challenging, as there is about 5.345 offensive words 450 per minutes in this dataset. Additionally, this dataset is a real 451 dataset that is taken directly to test and evaluate the trained 452 model, which adds to how challenging this dataset. The only 453 per-processing happened to this dataset is the properties of the 454 audio file that were set at sampling rate of 16-kHz, 1-channel, 455 and 19-bits PCM. This dataset was purposely created for 456 this research. Therefore, we have labeled all this dataset by 457 manually finding the foul words within the audio file and the 458 corresponding timestamps, in which the profane word occurs. 459 Therefore, the annotations of this dataset consist of the foul 460 words and its timestamps, as this work is to predict the foul 461 word and localize it within a long audio file. Hence, the 462 parts of the audio samples that were not labeled as foul, 463 are considered as normal conversational speech by default, 464 VOLUME 10, 2022  using 101 Log-Mel frequency spectrogram coefficients. Inap-506 propriate and safe speech spectrogram analysis was per-507 formed using the following parameters: 0.03 frame duration, 508 1 second segment duration, 0.015 overlap window between 509 frames, and 40 frequency bands. Furthermore, a lightweight 510 model with small-sized filters was proposed in order to 511 minimize the computational resource requirement and allow 512 the target application of real time film audio filtering to 513 be achieved. Therefore, the generated Log-Mel spectrogram 514 image dimensions had small size, 40-by-101 in size specifi-515 cally, where 40 is the normalized frequency of times 400-kHz 516 (40 times 400 kHz = 16-kHz) and 101 is the number of 517 spectrogram samples used. An example of raw signals of 518 two profane words and their corresponding spectrograms are 519 shown in Figure 1.
In the case of supervised CNN model, E2E learning mode 522 is done to fine-tune parameters of the whole CNN. Since 523 spectrogram images and labels were available during training 524 process, supervised learning was applied. The CNN is com-525 posed of convolutional, fully connected, pooling and batch 526 normalization layers. For detection of distinct signals, filters 527 in horizontal and vertical lines present in CNNs were passed 528 over input images. Mapping of image feature portions of the 529 signals were then performed, and the classifiers were trained 530 on the target task. Extraction of features of input images 531 and pixel relationships were sustained by obtaining image 532 features via small squares of input data using the convolution 533 layers. A mathematical operation that involves two inputs, 534 i.e., image matrix and a filter/kernel, was applied for the 535 extraction.

536
Reduction of parameters of a specific image was allowed 537 by the pooling layers. A common instance would be spatial 538 pooling, i.e., downsampling or sub-sampling, which retained 539 vital information while reducing dimensionality of each map. 540 This pooling type could be categorized into (i) max pooling, 541   In order to classify outputs related to the target task, an acti-549 vation method involving SoftMax or sigmoid can be applied.  Table 3 shows the details of the proposed CNN model archi- windowed sub-sample was then input to the CNN model for 577 class predictions performed based on the posterior probabil-578 ity, e.g., the class with highest posterior probability or positive 579 detection if decision threshold was exceeded. The predicted 580 class of the sub-sample is then assigned to the corresponding 581 timestamps generated during windowing phase. Localization 582 of recognized keyword within a long input audio sample 583 could be related to the timestamps of which the sample 584 consisted of identified profane word. Although continuous 585 speech or audio sample was used as input, windowing process 586 caused the inference for windowed samples to be consid-587 ered as static mode. This mode is used due to its simplicity 588 and produce a low number of false positives compared to 589 dynamic mode. Hence, dynamic mode requires additional 590 post-processing approaches to avoid such issue of increased 591 false positive rate [9].

593
The experimental setup, performance metrics and testing 594 results of the proposed system are discussed in this section. 595 The experimental settings and procedures utilized for appli-596 cation of automated detection of profane speech content in 597 film censorship are included in this section. The architecture 598 of the proposed foul language detector system is illustrated 599 in Figure 2. Feature extraction was performed on isolated 600 samples of English language to obtain the Log-Mel spectral 601 features, which were then sent into the CNNs for model 602 training. Similarly, the test features were obtained from audio 603 samples of real long audio files. These test features were used 604 to evaluate the performance of the trained models.

605
The expected outputs of the system were the prediction 606 probabilities of recognized profanity and the corresponding 607 timestamps to allow localization of the foul word detection 608 within test samples for film filtering. Hence, this work is not 609 an Automatic Speech Recognition (ASR), where speech con-610 tent if transcribed into the corresponding words. Additionally, 611 the proposed work is not a simple audio recognition where a 612 single spoken term from the same pool of dataset is fed into 613 a model and classified into the corresponding label, as the 614 test samples used is a continuous audio input of real-world 615 samples that are out of the training dataset pool. 616 VOLUME 10, 2022 In the equations, N tp , N fp , N fn , and N total referred to the 671 number of true positives, false positives, false negatives, 672 and total samples in all the segments respectively. Further-673 more, the performance was evaluated using area under curve 674 (AUC) and detection error trade-off (DET) curve. AUC was 675 computed after plotting the receiver operating characteristic 676 (ROC) curve which used FPR as the horizontal axis and TPR 677 as the vertical axis. This measurement reflects the robustness 678 of a binary classifier as the sensitivity threshold is varied. 679 On the contrary, DET is a graphical plot of error rates for 680 binary classification systems, i.e., graph of false rejection rate 681 (FNR) against false alarms rate (FPR).

683
The audio-based foul word recognition model proposed in 684 this research was designed to be applied for automated cen-685 sorship of audio channels of films. The experimental results 686 were obtained by running the novel test dataset, comprising 687 of continuous video files with high inappropriate word rates 688 per minute, through the trained models. Performance of the 689 model was determined using performance metrics such as 690 accuracy, F1 score, TPR, FPR and AUC. The results are dis-691 cussing the model's performance based on segment lengths, 692 probability thresholds, and process time figures.

694
The experiment includes a windowing and segmentation pro-695 cess for the lengthy continuous test samples, before it goes to 696 feature extraction, then inference and detection stages. There-697 fore, the segment length affects the detection and evaluation 698 metrics. Hence, all the test samples were evaluated using 699 three different segment lengths of 0.3, 0.4, and 0.5 seconds 700 to find the optimized segment length, that produce the best 701 and optimal system/model metrics for the detection of foul 702 languages. Although all the test samples were tested based 703 on different segment lengths, this paper will only demonstrate 704 the effect of segment length on foul language detection within 705 continuous audio samples, by highlighting the performance 706 metrics of two samples that are sample 1 and sample 2 at 707 a single probability threshold (th = 0.50) and three differ-708 ent segment lengths. Table 4 and Table 5 present the foul 709 language detection model performance using two samples 710 (sample 1 and 2), while Figure 3 and Figure 4 highlights 711 the two samples performance based on average accuracy and 712 F1-score, respectively.

713
Following Table 4 and Table 5, proposed model performed 714 positively in the detection of foul language with high average 715 accuracy, TPR, precision and F1-score, with low FNR and 716 FPR. For example, samples 1 achieved 20.75%, 11.32%, and 717 3.83% FNR, for segment length of 0.3, 0.4, and 0.5 segments 718 length respectively. Regardless, the model performance was 719      Table 4 and Table 5, all the performance 725 metrics were improved using larger segmentation length.

726
For example, TPR/recall and precision were improved and 727 TABLE 6. Overlap effect on performance metrics of sample 1 at 0.5 confidence score and 0.5 segment length.

TABLE 7.
Overlap effect on performance metrics of sample 2 at 0.5 confidence score and 0.5 segment length.
increased drastically with longer window length, while FNR 728 and FPR were improved and drops hugely at 0.5 second 729 segment length. F1-score and average accuracy charts and figures show that 736 increasing the segment length contributes into increment of 737 model performance metrics, which are increasing accuracy, 738 recall, precision, and F1-score. Consequently, the proposed 739 system achieved the best performance on profane language 740 detection using 0.5 segment length, where model achieved 741 a high F1-score 95.33% and 85.93% for the sample and 742 sample 2 test samples. Similarly, the model produced a high 743 average accuracy of 98.31% and 95.71% for sample 1 and 744 sample 2, successively. Therefore, 0.5 seconds considered as 745 the optimal window length for the developed system. Hence, 746 the proposed model was evaluated using 0.5 second segment 747 length and the following detailed results were obtained based 748 on the optimal window duration. 749

750
The experiment includes an automated windowing and seg-751 mentation process for continuous test samples. Therefore, the 752 fixed segment length affects the detection and evaluation met-753 rics for words that are longer than window length, in addition 754 to some keywords that might be spitted into two segments 755 due to the automated and fixed windowing process. Hence, 756 an overlap time was introduced to mitigate the error arises 757 from this issue and find the optimal performance of pro-758 fanities detection in a continuous sample with an automated 759 and fixed windowing process. Although all the test samples 760 were tested with and without overlap time, this paper only 761 demonstrates the effect of overlap length for foul words detec-762 tion within continuous audio, by detailing the performance 763 metrics of two samples that are sample 1 and sample 2 at 764 a single probability threshold (th = 0.50) and 0.5 segment 765 lengths. 766 VOLUME 10, 2022     Table 6 and Table 7,  Table 6 and Table 7, all the performance metrics were The performance assessment of the proposed models on the 800 detection of foul language for the six test samples is pre-801 sented in Table 8 through Table 13      sample 6. Although model test was done using threshold zero 803 through one, the tables present 0.1, 0.25, and 0.5 through 804 0.9 probability threshold. This is due to the common concern 805 of threshold performance above the common 0.5 confidence 806 score. However, all the thresholds starting from zero were 807 used when evaluating the model using ROC and DET curves 808 that are highlighted in the subsequent section. The results of 809 all samples' performance are presented due to the concern 810 of highlighting the model performance depending on differ-811 ent real-world samples, as different real time samples will 812 exhibit different characteristics like audio quality, noise, pitch 813 speed, etc. These characteristics produces different model's 814 response in terms of target keyword detection. Therefore, 815    contribute to the average numbers based on the weight of the 864 foul words within each sample compared to total foul words 865 for the whole dataset. The average metrics were computed 866 for all the thresholds. However, here we just highlight sim-867 ilar thresholds to the thresholds analysis tables. Therefore, 868 the models varied performance can be highlighted based on 869 different thresholds. For instance, how precision is affected 870 with varying thresholds. Hence, an operation threshold can 871 be chosen depending on the optimal metrics required for the 872 detection of profanities like minimizing FNR or minimizing 873 FPR.

874
Looking at the average metrics, it can be noteworthy that 875 increasing threshold contributes to a slight drop in aver-876 age accuracy (from 97.47% at 0.1 threshold to 95.34% 877 at 0.9 threshold for weighted average) and TPR/recall 878 (from 95.68% at 0.1 threshold to 92.16% at 0.9 threshold 879 for weighted average). Hence, FNR increases with thresh-880 old increment (from 4.32% at 0.1 threshold to 7.84% at 881 0.9 threshold for weighted average). In contrast, precision 882 increases dramatically with threshold increment from 85.12% 883 at 0.1 threshold to 93.75% at 0.9 threshold for weighted 884 average). Therefore, a huge drop in false detection (FPR) 885 (from 14.88% at 0.1 threshold to 6.25% at 0.9 threshold for 886 weighted average) occurred with threshold increment.

887
On the other hand, F1-score that is calculated based on 888 precision and recall varies with changing threshold and varies 889 between 89.96% and 93.25%. It is known that choosing 890 the operation points depends on the rates the user wishes 891 to achieve. For example, if F1-score matters more than all 892 the other metrics, then choosing 0.7 confidence score as the 893 best performing point, as it yields to the highest F1-score 894 based on the weighted average of around 93.25% F1-score. 895 ROC curve, AUC, and DET curve is another way of visual-896 izing the performance of the model at all operating points. 897 Figure 5 presents the ROC curves for all samples and the 898 averaged figures, in which the operating curves and the rela-899 tionship between TPR and FPR can be visually interpreted. 900 VOLUME 10, 2022    Table 17 shows the inference time of state-of-the-art CNN 928 model and the system overall process time from the input 929 of continuous speech, segmentation, through detection and 930 time estimation, where it was found that the proposed CNN 931 has inference time of 2.63 ms (0.00263 seconds) calculated 932 from the time step of applying the spectrogram image sam-933 ple at the input to the time step of model's prediction. The 934 reason behind that is the minimum number of parameters and 935 lightweight CNN of small filters and few layers. According 936 to Table 17, the average process time per each second of 937 the long audio samples, which can be defined as the average 938 time taken to process the input sample through all steps from 939 segmentation to automated detection per each second, which 940 is 0.46 seconds. This means each second of the long audio 941 will be processed completely in 0.46 seconds, that makes 942 this process to be real time process and even faster than the 943 human manual films' detection, filtering, and censorship of 944 inappropriate speech content. For example, sample 1 consist 945 of 371 seconds in total. However, the average time will be 946 taken to pass through the developed automated detection 947 for film censorship process, will be around 170.66 seconds, 948 which is less than half the length of the original sample. 949 Hence, the proposed system yield in saving time compared 950 to manual detection and censorship process. In addition to the 951      Table 20 and Table 14, it can be noted that 1005 current CNN model outperforms baseline 2 model based on 1006 all the evaluation metrics Based on the macro average metrics 1007 for both models, current model outperformed baseline 2 by 1008 around 6% average accuracy, (2% to 5%) recall/TPR and 1009 FNR, (1% to 3%) precision, and about (1% to 3%) F1-score. 1010 Table 21 presents the weighted average metrics of baseline 1011 2, whereas     In Table 22, we showed the outperforming results of the 1047 proposed system based on AUC metric, which is significantly 1048 better than other baseline systems, where proposed model 1049 outperformed baseline 1 algorithms with 2.55% macro aver-1050 age AUC and weighted average AUC of 0.64%. On the other 1051 hand, current model outperformed baseline 2 algorithms with 1052 4.58% macro average AUC and weighted average AUC of 1053 2.22%. Thus, current model outperformed baseline models 1054 in terms of AUC and all other metrics. Given the scarcity of experiments on inappropriate speech 1057 content detection, the first past of this subsection highlighted 1058 a comparative analysis using acoustic-based systems for pro-1059 fanity detection. On the other hand, this part benchmark the 1060 current work against previous work that uses ASR systems 1061 for the detection of profanities. Recent research proposed 1062 a solution for analyzing the video, which helps to iden-1063 tify the profane content through the use of text detection 1064 approaches after videos being transcribed by means of ASR 1065 systems [70]. The audio samples were extracted from the 1066 input video. Then, audio samples were converted into text 1067 using Speech-to-Text library for detection and localization of 1068 profane words. The text data samples were checked against 1069 a profanity list of words. The proposed system was tested 1070 with 50 videos collected from various sources like Facebook, 1071 YouTube etc. Additionally, some of the videos were made 1072 by authors containing profane keywords. The total length of 1073 test samples was only 1734 second (∼ 28.9 minutes). The 1074 developed profanity detection using ASR systems and text 1075 detection approaches achieved an accuracy of around 85.03% 1076 on the reported dataset [70].

1077
The reported ASR-based system containing two stages that 1078 are Speech-to-Text phase, and text detection approach, was 1079 retrained on the list of profanities proposed in this work to 1080 benchmark current work against ASR-based system. Then, the ASR-based system was tested using the six video samples   model, while ASR-based system requires two inferences for 1123 Speech-to-Text and text detection models.

1124
This experiment was performed on a particular dataset of 1125 a spoken English profane words with positive outcomes in 1126 any derivation of the profanities. Nevertheless, the proposed 1127 system performance may be varied by using a different range 1128 of English verbal words or spoken utterances from different 1129 language, as the proposed model uses the direct acoustic 1130 features of utterances for the detection, unlike ASR systems 1131 where spoken terms can be transcribed based on the language 1132 models used in ASR models and accommodate wider range 1133 of keywords. However, the use of ASR models suffers of 1134 the issues that majorly concern a large dataset and large 1135 computational cost, in which the two major issues is solved 1136 in this work for the development of profane words detec-1137 tor. Additionally, ASR systems uses a few stages of models 1138 like acoustic models and language models. In this context, 1139 an additional text detector will need to be applied to locate the 1140 inappropriate speech content. Therefore, ASR-based systems 1141 for the detection of profane words suffers of performance 1142 metrics drop due to the sequenced models, as a failure in one 1143 stage leads to performance drop in the following stage. The proposed CNN model for profanity detection and cen-1146 sorship was further analyzed and compared with four dif-1147 ferent pre-trained CNN models, which are MobileNet [71], 1148 Inception-v3 [72], AlexNet [73], and ResNet-50 [74] as 1149 detailed in Table 24 and Table 25. The models are compared 1150 VOLUME 10, 2022 recommended to consider several future developments for 1207 censorship and films rating researches. The context in which 1208 keyword is uttered is crucial to define a set of words that could 1209 represent the keyword. Therefore, considering the sequence 1210 and context of uttered words is recommended in future works. 1211

1212
This research suggested the implementation of CNN model 1213 for the detection and localization of spoken foul language 1214 in continuous speech samples with static keyword detection 1215 mode test for automated video/audio/film censorship. The 1216 current work utilizes a novel dataset of foul languages to 1217 train the model. MMUTM and TAPAD datasets were man-1218 ually labeled with 2 annotations (Foul vs Normal). The CNN 1219 model was trained to classify the labels of pre-segmented 1220 isolated samples, whereas current model was tested with 1221 continuous incoming audio samples for offensive language 1222 identification. The novel test dataset consists of several real-1223 world video samples with high rate of offensive words per 1224 minute. The model input was an extracted features of the 1225 audio samples in a form of Log-Mel spectrogram images, 1226 while the output of the whole system contains the detected 1227 foul word and timestamps of profanity occurrences within 1228 lengthy audio samples.

1229
The proposed system performed differently based on the 1230 different properties and characteristics of test samples. How-1231 ever, the overall foul language detection system has per-1232 formed positively with macro average accuracy ranging from 1233 95.11% to 97.67% and weighted average accuracy of 95.34% 1234 to 97.47% for all the operating points of thresholds. Further-1235 more, the reported F1-score metric for model performance 1236 showed a balance between sensitivity and specificity of pro-1237 posed CNN by achieving F1-score ranging from 88.54% to 1238 90.45% macro averaged and 89.96% to 92.91% for weighted 1239 average metrics. Additionally, current model achieved a high 1240 AUC metric of the ROC curve of around 93.85% macro 1241 averaged and 94.58% weighted average AUC metrics.

1242
The proposed lightweight CNN model was benchmarked 1243 against two baseline models that uses only acoustic fea-1244 tures on the novel offensive language dataset. It is reported 1245 that the current model outperformed the acoustic baseline 1246 algorithms in terms of performance metrics. We showed 1247 the outperforming results of the proposed system based on 1248 AUC metric, which is significantly better than other base-1249 line models, where proposed model outperformed baseline 1250 1 algorithms with 2.55% macro average AUC and weighted 1251 average AUC of 0.64%. On the other hand, proposed system 1252 outperformed baseline 2 model with 4.58% macro average 1253 AUC and weighted average AUC of 2.22%. Thus, current 1254 model outperformed baseline models in terms of AUC and all 1255 other metrics. Additionally, proposed acoustic system outper-1256 formed ASR-based system for profanity detection based on 1257 the evaluation metrics including AUC, accuracy, precision, 1258 and F1-score.

1259
This work also demonstrated that proposed system for 1260 audible and speech content processing and detection of 1261 terms of inference and overall process speed. It was found 1263 that the proposed CNN has inference time of 2.63 ms 1264 (0.00263 seconds), which is attested to the light-weight struc-1265 ture of developed model. Furthermore, the average time 1266 taken to process the input sample through all steps from 1267 segmentation to automated detection per each second, which 1268 is 0.46 seconds. This means each second of the long audio 1269 will be processed completely in 0.46 seconds, that makes 1270 this process to be real time process and even faster than the 1271 human manual films' detection, filtering, and censorship of 1272 inappropriate speech content. This attested to the light-weight 1273 structure of CNN architecture, which make the process and 1274 inference to be faster and suitable for content screening, 1275 filtering, and censorship.