Movie Popularity and Target Audience Prediction Using the Content-Based Recommender System

The movie is one of the integral components of our everyday entertainment. The worldwide movie industry is one of the most growing and significant industries and seizing the attention of people of all ages. It has been observed in the recent study that only a few of the movies achieve success. Uncertainty in the sector has created immense pressure on the film production stakeholder. Moviemakers and researchers continuously feel it necessary to have some expert systems predicting the movie success probability preceding its production with reasonable accuracy. A maximum of the research work has been conducted to predict the movie popularity in the post-production stage. To help the movie maker estimate the upcoming film and make necessary changes, we need to conduct the prediction at the early stage of movie production and provide specific observations about the upcoming movie. This study has proposed a content-based (CB) movie recommendation system (RS) using preliminary movie features like genre, cast, director, keywords, and movie description. Using RS output and movie rating and voting information of similar movies, we created a new feature set and proposed a CNN deep learning (DL) model to build a multiclass movie popularity prediction system. We also proposed a system to predict the popularity of the upcoming movie among different audience groups. We have divided the audience group into four age groups junior, teenage, mid-age and senior. This study has used publicly available Internet Movie Database (IMDb) data and The Movie Database (TMDb) data. We had implemented a multiclass classification model and achieved 96.8% accuracy, which outperforms all the benchmark models. This study highlights the potential of predictive and prescriptive data analytics in information systems to support industry decisions.


I. INTRODUCTION
The worldwide movie industry is a fast-moving revenue generating industry, and the multi-billion dollar has been involved in this industry. A large number of people are associated with this industry, and massive investment is required as qualitative and quantitative. In 2019 the total box office revenue of the United States and Canada was $11.32 billion [59]. However, in-ground reality, few numbers movies has been achieved success. Film producers and researchers constantly feel it essential to have some expert systems that The associate editor coordinating the review of this manuscript and approving it for publication was Haiyong Zheng . predict the movie's success chance leading its production with appropriate accuracy. The movie industry is massive and diversified. A significant number of parameters from different dimensions are involved in creating a movie. Representing an upcoming movie's success or degree of success is a highly complex task. Research works [32], [33], [60] have been conducted to predict movie popularity. Earlier, several works have been conducted on post-production or postrelease forecast. However, it is not beneficial as the investor has already contributed their funds to the film production. The early production stage and pre-production prediction with satisfying accuracy have been beneficial to secure investment. A forecast made soon after the cast, director, and storyline have been finalized would assist the investor in making a financial decision.
After a rigorous study, we have seen significant researchon movie hit prediction before the official release. Predictions performed shortly before [32], [56] or following [33], [60], [34] the official release (the last stage in film production) may have additional data to use and produce a more precise prediction [66]. Still, they are considerably delayed for investors to estimate any critical decision. Early-stage (production) forecast [39], [40] of movie success is the most beneficial. Very little work has been performed to forecast movie success at an early stage of movie production. The early-stage forecast of previous works' accuracy is not significantly good. Maximum of the works are performed only to focus success probability of the upcoming movie. Some of them classify the problem into a binary problem (hit/ flop), and in some work, they classify the problem into a multiclass problem. Movie Makers start creating a new movie while targeting a specific audience group or groups most of the time. Audience age is one of the essential criteria for the target audience [61]. Some movies are created by targeting the junior audience group. Some movies target teenage audiences, sometimes target the mid-age and senior audience group, and some movies are for all. Suppose we could predict whether the upcoming movie would be famous among the target audience group or not. Movie makers would be benefited if we could measure the influence of the upcoming movie among all the age groups at the early stage of the movie production. Then, movie Maker could make necessary changes if needed. The movie hit forecasting and target audience prediction of the upcoming movie at the early stage of the movie production are interrelated and meaningful. The outcome of this work could reduce the risk involved in the movie industry.
Our research problem proposed a system to predict movie success at an early stage of movie production and performs movie Target audience prediction. Both the above works have been done using the CB movie recommendation system. Our research work can be folded into three significant parts. In our framework, the first module is a movie recommendation system [2]. We have considered only five essential features of the movie like genre, cast, director, keywords and movie description as the feature sets to build the recommendation system. All these basic features are available at the early stage of the movie production. The proposed recommendation system provides a set of similar movies of a given upcoming movie. The second module accepts similar movies from the first module, uses movie rating and voting information of similar movies, and creates a novel feature set. Next, we have proposed a CNN model and use the newly created data set to predict the movie's popularity. We have divided the popularity of a movie into six classes super-duper hit (SDH), super hit (SH), hit (H), above average (AA), average (A) and flop (F). Next, in the third and final phase, we build a module to estimate the target audience. We have divided the audience group into four age groups junior, teenage, mid-age and senior. We used a similar movie set from the first module and created a new feature set from each age group considering movie rating and voting information. Using the new data set, we built a model using fuzzy c means and cosine similarity to estimate the popularity of the upcoming movie among all age groups.
The primary contributions of this study are as follows.
1. This research work is among the foremost in the previous studies to use a recommendation system to predict upcoming movie popularity at its early production stage. 2. Proposed a model to estimate the popularity of an upcoming movie among different age groups using fuzzy c mean and cosine distance.
The rest of this paper is arranged as follows. Section II summarizes the related work to RS and film forecasting. Section III outlines our proposed framework in detail, illustrates all the features and introduces the movie hit success criterion. In Section IV proposed model is described elaborately. Section V presents the experimental results simultaneously with a comparative study of other statistical models shown and explained-finally, research contributions and their limitations and further research directions in section VI.

II. RELATED WORKS
In this section, we have presented a detailed survey on the past related works. Our work is divided into three interrelated parts. Highlight some of the previously proposed models of recommendation system and then discuss movie popularity prediction. In the proposed work, movie recommendation and movie popularity are not interrelated. Our proposed work has used recommended movies to predict the upcoming movie's popularity and predict the movie's target audience. The recommendation system is primarily divided into three parts [1]- [3], [18], collaborative filtering (CF) [4]- [8], [17], content-based filtering(CBF) [8], [9], [11]- [13], [31] and hybrid filtering [14]- [16]. CF is a procedure that can refine things that a user might prefer based on responses by similar users. It searches a broad group of people and gets a smaller circle of users with tastes comparable to a particular user. It looks at the things they like and connects them to form a ranked list of suggestions.

A. CONTENT-BASED RECOMMENDATION
In some application situations, the recommended items must be content-wise comparable to a reference item, e.g., for similar item recommendations [19]. Also, content information allows the period of better descriptions [20], which is becoming frequently crucial in fair and open recommender systems. Content-based recommender systems utilize metadata information of items or textual items [21]. Linked Open Data (LOD) initiation suggests new ideas to extend item information with outside knowledge sources [22], [23]. Movie recommendation using CBF is one of the widely used research paradigms. A content-based movie recommender has been proposed where users with and movie features are used [24]. Proposed movie rating using movie feature set. Content-based movie recommended system considered different movie attributes like movie genre, name of the actors, name of the directors, and other attributes to build a recommender system. The movie genre that users prefer to watch has been used to build a recommender system using Movie Lens dataset [25]. Correlations between content or attributes are measured to find out the similarity between items. A multi-attribute network has been proposed to calculate the correlations to recommend items to users [26]. The similarity between directly or indirectly correlated items is calculated using network analysis. They have proposed a hybrid model where genomic tags of the movie have been used with CBF to recommend movies with similar tastes [27]. The proposed model reduces the computational complexity by using principal component analysis (PCA) and Pearson correlation procedures to reduce redundant tags and dispense a low variance [64]. In the following work, authors have used and leveraged the gap between high-level and low-level features [28]. They have used low-level feature colors, motion, exceeds and lighting from film to make a hybrid recommendation system. A new movie recommendation system has been proposed and addresses the cold start problem for the new item [29]. They have offered audio and visual descriptions extracted from movie videos and developed a video genome. A hybrid movie recommendation system has been proposed to incorporate sentiment analysis with collaborative filtering (CF) [30]. Movie tweets have been used from micro blogging sites to understand the public sentiment, current trends, and user response. The sparsity of data is one of the significant challenges for recommendation system algorithms. In the following work, a generative adversarial network (GAN) has dealt with the sparsity of review data and rating. They have proposed Rating and Review Generative Adversarial Networks (RRGAN), an innovative framework for the recommendation [67]. GAN also been used to rank the movie according to the preference of the users. In the following work,LambdaGAN has been used for recommending top-N movies [68].

B. MOVIE HIT PREDICTION
Movie feat broadcasting is a well-known problem of research. The problem is broadly divided into two primary groups based on forecasting time. Significant work has been proposed where predictions were made very late at the production stage before the movie's release or just after the movie's official release [32]- [38], [60]. Limited works have been carried out where movie hit forecasting has been executed at the initial stage or the early stage of the production [39]- [43]. The late prediction may be facilitated by more movie attributes to increase the forecasting accuracy. On the other hand, for the early prediction, only a few attributes or features are available for making movie predictions which makes the problem much more difficult.
One of the most significant parts of our problem is defining the success of a movie. No benchmark models exist which define the success of a movie. Few works have focused on whole box office revenue [40], [43]- [48]. At the same time, some have adopted the number of admissions [34], [49]. The underlying assumptions to make revenue or the number of admissions as the parameter of success. Some of the earlier works measured success as profitability. It may be a numeric value of revenue [50] or the return on investment (ROI) [39], [51], [52]. Several works distributed movies into two classes (success or not) and selected binary classifications; some considered the forecast a multiclass classification problem and tried to classify films into multiple discrete classes [47]. Predictions are also made on continuous integral values of profit metrics [32], [39], [53], with values of these metrics containing logarithmic in some works [48], [50], [54].
The movie hit forecast trusted machine learning models considering these learning techniques have developed prediction models with reasonable levels of accuracy [55], [52], [56], [65]. For instance, [56] has presented some machine learning models such as discriminate analysis, Logistic regression (LR), Decision tree (DT), and Neural Network (ANN) and measured the performance to predict a movie's success. Authors of [57] have proposed the multi-layer back propagation architecture and a more quality increased neural network model proposed by [56]. The authors [58] obtained movie data from websites like and rotten tomatoes, IMDb and executed machine learning strategies like support vector machine (SVM) and linear and Logistic regression. Authors [38] Introduced Cinema Ensemble Model (CEM) to enhance forecast accuracy, comprise seven machine learning models, and focus on selecting attributes. The research [40] proposed few new features to predict the box-office success of a movie. They have adopted a Voting system to foretell by averaging the output from various machine learning classifiers.
In most of the works, similar movies are computed using an RS from the existing movie. The recommendation system for the upcoming movie is sporadic. Content-based movie RS could be used to find out similar movies for an upcoming movie. Box-office information of all these similar movies could be used to analyze the upcoming movie. Our research work has used Content-based movie RS for an upcoming movie and find out similar movies. We analyze the output of the RS and successfully build a model to forecast movie popularity. Again, we move one step ahead and also predict the target audience from the information of the RS. Most of the research works focused on the movie hit prediction problem as a binary classification problem. Very few of the works [40] resolve the problem as a multiclass problem, but they have sacrificed the accuracy in the process. Our research work has classified the movie popularity problem into six different classes and achieved high accuracy. Maximum of the works only target the movie popularity prediction problem; they have only predicted the popularity of the upcoming movie. Research work regarding the forecast of the target audience of the upcoming movie is sporadic.

III. MATERIALS AND METHODS
This research study aimed to develop a model that will predict movie popularity and its age-wise preference using movie recommendations. Our objective is to classify the movie popularity among the six classes {SDH, SH, H, AA, A, F} at the early movie production stage. Next, our objective is to find the movie's target audience and determine its influence on audience groups. Regroup the audience into four age demography {Junior, Teenage, Mid-Age, Senior}. Our final output of the system will be age-wise movie popularity prediction.
In this study, we used a content-based movie recommendation system to find out a similar movie. In the next step, we use the voting information and rating of each recommended movies. All these data are used to train the 1-D CNN deep learning model. The output of the CNN model is the classification of the film among six classes. We predict the scale of popularity of the movie.
Our third module of the system takes recommended movie information and age-wise voting information. We have grouped the voting information into four age groups. We use Fuzzy C-Mean to calculate the movie preference for each age group.
The framework of our job has three significant steps, which are listed below Fig. 1 1. Acquire movie data and movie intrinsic features from TMDb dataset and computes similar movie using a content-based movie recommendation system. 2. Use similar movie information and voting data from the IMDb beta set. Predict the movie popularity using the Deep learning approach. 3. Compute target audience prediction using fuzzy c means.
• We have introduced a new data set containing voting and rating information of the recommended movies, used to predict the movie popularity class.
• We have proposed new parameters called Global centroid for each age group.
• We have also presented a new approach for estimating the interest or popularity of an upcoming movie among distinct age groups. A. DATASET DESCRIPTION The proposed system has three modules. The first module is a content-based (CB) recommendation system (RS), the first to use the TMDb database. The second and third modules make use of the IMDb database. Multiple public databases are available in the market, and all these databases are used for movie recommendation and movie popularity prediction system. The proposed content-based movie recommendation system used tmdb_5000_movies and tmdb_5000_credits datasets, which are publicly available [62]. The movie hit prediction and target audience prediction module make use of the IMDb rating dataset. Use of two different databases (TMDb and IMDb) creates synchronization problems since they use two separate movies ID. The proposed system uses the link small dataset to merge two databases.
The tmdb_5000_movie data set is consistent with 4803 movie data. It contains 20 movie attributes. That isits accountants the movie with published year from 1916 to 2017. From the 20 attributes, we have chosen only 4 attributes. Selected attributes are keywords, overview, tagline and genre. Attributes like budget, revenue, and release year are very much dependent on time, and since we are considering movie of more than a hundred years of span, these attributes could not be selected. Other attributes like runtime, spoken language, original title, homepage are also irrelevant to the proposed system.
The tmdb_5000_credits data set consisted of 4813 movie data. The data set has 4 attributes movie_id, title, cast, and crew. We have extracted the name of three primary cast members and each movie's director name from the data set. The director is one of the most influential characters of a movie. Similarly, the first 3 cast members are also critical. The movie attributes selected for the proposed content-based movie system are shown in table 1. IMDb database is publicly available for research work. The imdb_rating dataset has been used in the proposed work, which consisted of 85856 movie rating data [63]. The data set content rating and voting details of each movie. The data set has a total of 49 attributes, which includes gender-wise and age demography wise voting and rating details. TMDb data set uses tmdb_ID, and IMDb dataset uses imdb_id; they use two different movies ID. Links for all data set creates a link between two movie IDs. Otherwise, using two different movie databases would create a synchronization problem. Since multiple databases are used in the proposed work and each data set has a different size. After combining all the data sets and synchronizing the IMDB_ID with the tmdb_ID, the number of movies used in the work is 3100.

B. DATASET PREPROCESSING
The attributes used in the proposed CB movie recommendation system are mentioned in the table 1. The value of the attributes like genre, cast, director and keywords are present in JSON data. Convert the JSON data into the string, which removes all the metadata and contains only attribute values. Next, for each attribute, one list has been generated with all the unique values of the attributes mentioned in the dataset. Next, performs the one-hot encoding to all attitudes for each movie.
The tagline of a movie is appended with the overview attribute. Combining these two attributes needs to go through several steps before finding out the similarity between them.
Step-1: Clean the data correctly. Moreover, this will allow us to reduce sentences, paragraphs, and ultimately docs to a set of single words.
Step-2: Go through the stemming process of overcoming inflected words to their word stem, root or origin form.
Step-3: Remove stop words from the set of words. English word dictionary has been used to eliminate stop words.
Step-4: After making the set of clean and filter data, next is implementing the core functionality. Term frequency is calculated by the number of times the term t repeat in the document d.
Step-5: Inverse document frequency (IDF) is used to pull down the weight of frequent terms. In contrast, size up the rare ones, by calculating IDF, the log of the total number of documents N in the corpus D divided by the number of documents df t including the term t.
Step-6: Finally, for a term t the weight in a particular document d is determined as the outcome of the two preceding calculations: Step-7: Compute the similarity between two movie descriptions using cosine similarity.

C. MOVIE POPULARITY LABELLING (HIT TO FLOP)
The motion picture is one of the branches of art. Several parameters are associated with the movie industry. Movie interest is a complicated and extensive industry lot of elements are associated with the industry. Different perspectives are there to consider that several parameters are there to assess a movie's success. Box office revenue could be one of the parameters. The budget and the movie's revenue are changeable and depend on the movie industry. For some industries, the budget may be generally higher than the other small movie industry or varies from film to film. Defining revenue range [40] to classify the movie success does not apply to all movies. Also, fixing the profit margin [39] will not solve the problem of classification. The scale of profit margin is undoubtedly lower for low budget and higher for high budget films. Considering all these things, the IMDb rating is one of the significant criteria for movie success prediction, and also IMDb rating is a globally accepted rating. Figure 2 presents the histogram of the IMDb rating of all the films considered in our database. It shows that the rating is near a normal distribution, contributing to the model prediction's robustness. In our work, we have used IMDb rating as the primary parameter to determine the movie success. In this work movie, popularity is classified into six classes. In our early production stage movie classification problem, the prediction module would predict the IMDb rating for a range of IMDb ratings. According to the predicted IMDb rating, the upcoming movie e is classified into six classes.
Classes are specified as a super-duper hit (SDH), super hit (SH), hit (H), above average (AA), average (A) and flop (F). We have prepared the movie data set and labeled the movie in different classes according to their IMDb rating. Table 2 represents how movies are classified into different classes according to their IMDb rating.

IV. PROPOSED WORK
The proposed system has three major interrelated modules. The first module is a content-based movie recommendation system model, which produces 10 most similar movies of all the movies listed in the data set. The second module is a movie hit prediction module. The output of the first module is the input of the second module. Next, the third module groups the audience according to their age and predict the most suitable target audience group for each movie. Finally, the whole system provides an age-wise movie popularity prediction of the upcoming movie. The overall process flow diagram has shown in figure 3.

A. CONTENT-BASED MOVIE RECOMMENDATION
Content-based filter used for finding a similar movie. Which uses movie attributes to find out the similarity between the two movies. Let feature set F = (F 1 , F 2 , . . . , F m ). Compute the similarity between any two movies m i &m j concerning the feature F k is: In (1), the dist i,j is the distance vector between the two movies m i & m j . The objective of this module is to compute N most similar movies of movie m i . The similarity between any two movies has been measured using m different features. Distance between any two movies is an m dimensional vector. The nearest neighbour algorithm has been used to find out N most similar movies. The overall similarity measure between the two movies is computed by using similarity measures like cosine similarity: In (4), We have computed the similarity measure between any movie m i with all the other n movies present in the data set. The m dimensional distance vector is reduced to onedimensional distance using the cosine similarity measure. Figure 4 presents the block diagram of the CB movie recommendation module.

B. MOVIE HIT PREDICTION
The second module's purpose in this research problem is to build a movie hit prediction system. This module accepts the earlier module's output. From the previous module set of the recommended movie of each movie m i is used as input, i.e. RM = ∪ n i=1 {rm i j | j = 1, 2, . . . , N }. With that IMDb rating data set has been used as an input. The movie hit prediction is a multiclass classification problem. Input data are processed and fit into a deep learning model, and output is a multiclass classification. In this research problem, movies are classified into six classes, super-duper hit (SDH), super hit (SH), hit (H), above average (AA), average (A) and flop (F). Figure 4 presents the framework of the movie hit prediction module.
In this module, we have used the rating and the voting details of N recommended movie rm i j of the movie m i . Let v i r,k represents the number of the vote with a rating r for kth VOLUME 10, 2022 Algorithm 1 Movie recommendation using multidimensional KNN. Input: where k i transfer to binary word binw_bin i 3.
where g i transfer to binary genrebing_bin i 4. DT BIN where dt i transfer to binary director bindt_bin i 5.
where c i transfer to binary cast binc_bin i 6. for (Each movie m i ∈ M ) 7.
for Each movie m j ∈ M &m j = m i 8.
In (6), the V i represents the voting details of the movie m i . In (7) the R i j rating of each recommended movie, also considered as:   Prediction of the class of the upcoming movie, the data set X has been used as the input. The input data set contains 3,100 movie details. The data set has been labelled with six different classes. The input data set is again divided into the training and test part. The training part contains 2,325 movie data, and the test contains 775 movie data. The convolutional neural network classification model has been used to classify movies into 6 different classes according to popularity. Figure 5 represents the block diagram of the movie hit prediction module.
The proposed deep learning model is 1D-CNN architecture. It consists of three convolutional layers and one dense layer. The structure of the 1D-CNN is experimentally selected by a trial and error approach. The first Layer of CNN has taken an input of 22×1 array comprising all features. We have used 128 filters in the first layer with kernel size 5. We have adopted the activation function Relu and dropout 0.1. With that, max-pooling is estimated as 2. The next layer maintains 128 kernel size and the same activation function Relu and dropout 0.1 and repeats this two times. Finally, a flattering layer has been used. The Last Layer is a dense layer with six over here with 6 predictive classes. Multiple Keras Optimizers have been experimented like Adam, SGD, RMSprop, and finally, we have selected RMSprop optimizer due to better accuracy. Figure 6 depicts the topology of the proposed 1D-CNN.

C. TARGET AUDIENCE PREDICTION
The third and final module forecasts the target audience's preference according to the age of demography. To predict the upcoming movie's target audience, we have considered all the recommended movies delivered by our first module. 3.
where v i r,k = no. of vote with rating r of kth recommended movie of m i 5.
V i = {V i r |r = 1, 2, . . . , 10 voting details of each movie m i The system would use user rating by each age group and take the number of vote details from each group to each recommended movie-system analysis of all the voting and rating information from each group of all the recommended movies. Ultimately, the module would predict the popularity of the upcoming movie for each age group. The target audience prediction module takes input from the first module output, and that also takes the IMDb rating data set as input. Movie recommendation module produces similar movies (rm i j | j = 1, 2, . . . , N ) of a given movie mi. We have used a set of recommended movies of all movie RM = ∪ n i=1 {rm i j | j = 1, 2, . . . , N } present in the data set. We have also taken the IMDb rating data set, including voting and rating information of all movies present in the data set. The proposed system divides the audience into four groups according to age demography. The proposed system forecast how much preferable the movie would be for each group for an upcoming movie. Which group would prefer the film most and which group would not like the movie. Table 5 presents the viewer's age groups. We have separated each group Gr j . Moreover, create separate data set using the recommended movie data set and IMDb rating data set. Each data set consists of rating and voting information of all recommended movies of a movie m i .
In (8)  To compare two groups, we need to have an overall preference for each group. The centroid of the cluster measures the performance of a cluster. Each group consists of several data points. Fuzzy c mean is used to determine the cluster centroid of each group.
Cluster centroid is the parameter to measure the audience group's performance or likings to the movie m. Experimentally we have set a global centroid for all movies. The global centroid is unique for each group. The distance of the cluster  centroid from the global centroid is measured. Figure 7 represents the block diagram of the Target Audience Prediction Model.
In the earlier section, we define the popularity of a movie depending on the movie rating. Where we specify a popular movie if the movie rating mr i > 7. In our global centroid parameter, we also set a global rating G_R = 7. We have considered the supported number of voting must be greater than or equal to the median value. The median value is different for each group. We have considered the voting ratio, and we have fixed the Global voting ratio G_Vr = 1. Next, consolidating the Global parameters, we are set to measure the similarity among the Global centroid and movie centroid for a group.
If C_R i Gr j > G_R we set the value C_R i Gr j = G_R, similarly if C_Vr i Gr j > G_Vr we set the value of C_Vr i Gr j = G_Vr Distance between the movie centroid C_centroid i Gr j of the group and global centroid, G_centroid determine the movie's popularity within the group. If the distance is low, then the popularity is high. If the distance is high, the movie's popularity is low within the group. The distance measure is converted to the popularity percentage. If a group centroid has C_R i Gr j ≥ G_R and C_Vr i Gr j ≥ G_Vr, then the popularity measure would be 100%. Finally, the module predicts the popularity of an upcoming movie among each age group. for(Each group{Gr j |j = 1, 2, 3, 4}) 9.
Elements in cluster Centroid of each cluster using Fuzzy c mean 12.
Calculate percentage similarity Per_similarity i Gr j of each group Gr j 16. Return Per_similarity

V. EXPERIMENTAL RESULTS AND ANALYSIS
In the research problem, three different modules perform an individual responsibility. The first model is a recommendation system module to find similar movies from the data set. The second module uses the first module's input and classifies the upcoming movie into six different classes according to the popularity prediction. Furthermore, that third module is the target audience prediction module. Each module's experimental study is critical-the recommendation system computes similar movies for a given movie from the data set. The movie hit prediction is a multiclass problem. The accuracy of the movie hit prediction module is highly dependent on the efficiency of the first model. It has been imperative to find and recommend the most similar movies to predict the upcoming movie's class. The target audience prediction module categorically estimates the likings of the movie to each age group. We have used the rating and voting date of all the recommended movies.

A. MOVIE RECOMMENDATION
The proposed content-based movie recommendation system uses features like genre, cast, director names, keywords, and movie description to measure the similarity between two movies. The similarity between the two movies is computed after calculating the similarity distance from each parameter. Table 8 shows the recommended film of the movie ''The VOLUME 10, 2022 Terminator'' and Table 9 shows the movie ''the Avenger''. The tables presented the top 10 most similar movies of each selected movie and calculated the overall similarity distance from the selected movie to each recommended movie. Also, we have presented the genre of each of the recommended movies. The recommendation system computed the similarity distance from the selected movie. The similarity distance between two movies defines the similarity between them. As distance decreases, the similarity between two movies increases and vice versa. According to the computation, ''Terminator 2'' is the most similar movie to ''The Terminator''. The measured overall similarity distance is 1.2698, the minimum among all the distances. Similarly, ''Avengers: Age of Ultron'' is the most similar movie ''The Avenger'' compared to other movies. The overall similarity distance measured by the recommendation system is 0.5031. The movie ''The Terminator'' genres are action, thriller, and science fiction. All the selected recommended movies are also having almost the same genres. The movie ''The Avengers'' has genres science fiction, action and adventure. All the selected similar movies are also having almost the same genres. Table 10 shows the recommended movies for the film American beauty. The genre of the movie is drama. Hear all recommended movies also having the same genre. Overall calculated dishes are also shown in the table. According to the system, Revolutionary Road is the most similar movie to the selected movie, and the overall similarity distance is 1.5076.

B. MOVIE HIT PREDICTION
This subsection will discuss the movie hit prediction. The movie hit prediction is a multiclass classification problem. We classify the problem into six popularity classes {SDH, SH, H, AA, A, F}. The initial recommendation system module provides N similar types of movies of the upcoming movie. The system uses the voting and rating parameters of all the recommended movies. Finally, it predicts the upcoming movie's popularity-the accuracy of the prediction model is principally related to the recommendation system model's accuracy.
The data set has 3310 movie data. We have separated the data set into an 80:20 ratio. 80% means 2684 movie data for training purposes and the remaining 626 data for testing purposes. We have this newly created novel data set and experimented with different machine learning models for better accuracy. Table 11 compare different machine learning model using several evolution parameters. The proposed CNN model performs significantly better than all machine learning models, comparing all the parameters. Fig. 8 shows the accuracy of the machine learning model. The experimental analysis of the data set indicates that the proposed convolutional neural network model's overall performance outperforms all the baseline conventional machine learning models. Figure 9 and 10 shows the proposed CNN model accuracy curve and loss curve respectively.
The proposed movie hit prediction module also outperforms the predicted system presented in the past. We have   presented a comparative study of our work with some relevant research work done in the past. Ahmed et al. (2019) [22] used a hybrid voting system to obtain the early production stage forecast. In this study, they introduced new features to improve prediction efficiency. They classified the movie into eight different success classes and gained 85% accuracy. Abidi et al. (2020) [27] examined each movie's attributes and selected all the features relevant to movie prediction's early stage. They have Executed five various machine learning models with binary classification and gained a height of 76.6% accuracy with the Generalized linear model (GLM). Michael and Kang (2016) [10] introduced novel features and predicted movie success at an early movie production stage. They evaluated Machine learning models multiple times with distinct success measures. Michael and Kang (2016) achieved a maximum of 90.4% accuracy with the Random Forest model using binary classification. For multiclass classification, they have achieved 84.7% accuracy. Verma and Garima (2019) [24] proposed ''music rating'' as one of the significant features to forecast movie hits. They attained 87.0% accuracy with the Random Forest model using binary classification. Our proposed model outperforms all the previous models and achieved 96.8% accuracy. Table 12 presents Comparative Analysis.

C. PREDICT PREFERRED AUDIENCE GROUP
This subsection will discuss the result and analysis of the final module of the movie target audience prediction. Audiences VOLUME 10, 2022  are grouped into four age groups {junior, teenage, mid-age, senior}. To predict the upcoming movie's target audience, we have considered all the recommended movies delivered by our first module. Then, the system uses each recommended movie's rating by each age group and takes each group's number of vote details for each recommended movie-system analysis of all the voting and rating information from each group of all the recommended movies. Ultimately, the module would predict the popularity of the upcoming movie for each age group.
Some movies are created targeting only the junior age group. Generally, animation movies are specially targeted to the junior age group. Our proposed system focused on how popular the upcoming movie would be among all age groups. The Movie Maker can estimate the popularity of the upcoming animation movie among the junior age group. Usually,   hit movies always make a good impact among all age groups. Table 13 shows the popularity of the five leading animation movies among all age groups-junior groups like animated movies most among all the age groups. Figure 12 shows the popularity of animation movies within the junior group, and it also reveals the stepwise decrement in popularity as the age increased.  Higher age groups commonly favour movies like comedy, drama, and romance. Moviemakers usually target mid-age and senior groups for comedy and drama genres for movies. Table 14 shows the popularity of the five movies from comedy and drama genres among all age groups-senior and mid-age groups like such movies most among all the age groups. Figure 13 shows the popularity of comedy movies, excluding science fiction, adventure, and Animation movies. It shows popularity among the senior group is maximum, next mid-age group. Popularity among junior is the least.
Science fiction movies are usually preferred by all age groups, depending upon the storyline of the movieparticularly teenage group like the science fiction movies most. Table 15 shows the popularity of the five leading science fiction movies among all age groups-the teenage group likes the most among all the age groups. All five movies are hit movies; hence they are popular among all age groups. Figure 14 shows the popularity of science fiction movies among all groups. It shows that popularity among the teenage group is maximum. Popularity within the junior group varies to a large extent. The average popularity of science fiction movies is usually high.

VI. CONCLUSION
A substantial amount of financing is consumed in every boxoffice movie. However, most movies fail to achieve success. Earlier, the most significant number of works have been done on post-production or post-release forecast. The estimate does not influence as the investor has already consumed their funds on the film production. The pre-production or early production stage forecast needs high accuracy and the best time to ensure investment. The objective of our study is to propose an expert system that could help the movie maker execute necessary changes if needed at the appropriate time. Our system can food cost the level of popularity of the upcoming movie before the production has started for the earliest stage of the production and with significant accuracy. About system focused not only on the popularity of the upcoming movie but also on the movie's popularity among all age groups. Movie Maker can estimate the target audience and assess how the different audience groups would respond to the upcoming movie. Further, our target is to build a robust system applicable to all movie industries. We have used the last hundred years (1915-2016) of movie data from TMDb and the IMDb database. Our approach to focused movie popularity and finding out the target audience of an upcoming movie is very much unique. In our approach, we have used a recommendation system to find similar movies from a given movie and use similar movies for forecasting purposes. Moreover, it has been challenging to simultaneously use to separate the database (TMDb and IMDb). The size of the TMDb data set is 4803, and the size of the IMDb rating data set is 85855. Since we are using both the dataset, the size of the merged data set comes down to 23332 only. We need to judge each of the features that can be available at the beginning of movie production. We have carefully picked only five movie attributes for our recommendation system. We have used total votes for each movie. It has been observed that the number of votes for the old movie is relatively less than the new movie. Moreover, several voting information for the junior group is significantly less relative to the other groups. Using the voting and rating information to create a new feature set is challenging to work in this scenario. The proposed system is an excellent tool for the movie industry. In future work, multimedia data like audio and video data could be incorporated and also, the poster of the upcoming movie could be used for better results. Recent train tickets could be analyzed using sentiment analysis of the social media data. Information regarding recent trained on the market expectation from the movie industry will be beneficial for the movie makers. The audience group could be divided according to age and according to the demography or profession of the audience. That will be much easier for targeting and promoting an upcoming movie.

ACKNOWLEDGMENT
The work of Jana Shafi was supported by the Deanship of Scientific Research, Prince Sattam Bin Abdulaziz University.
JANA SHAFI is affiliated with the Department of Computer Science, Prince Sattam Bin Abdulaziz University, Saudi Arabia. She has more than eight years of teaching and research experience. She has published in numerous journals, such as Sensors, IEEE ACCESS, Diagnostics, Symmetry, Mathematics and Wireless Communications, and Mobile Computing. Her research interests include online social networks, wearable technology, artificial intelligence, machine learning, deep learning, smart health, and the IoMT.
YOGESH KUMAR received the Ph.D. degree in CSE from Punjabi University, Patiala. He is currently working as an Associate Professor-CSE with Indus University, Ahmedabad. He has a total of 15 years of experience (including teaching and research) and a post Ph.D. experience of two years and 11 months. He has published 66 research articles, including 14 articles in high-impact SCI journals, 31 articles in Scopus and peer-reviewed journals, 14 papers in international conferences in India and abroad, and eight high-impact book chapters. He has also published two books and granted seven patents. His research interests include the artificial intelligence, deep learning, speech recognition systems, and data science.