A Novel Multi-Module Approach to Predict Crime Based on Multivariate Spatio-Temporal Data Using Attention and Sequential Fusion Model

Forecasting crime is complex since several complicated aspects contribute to a crime. Predicting crime becomes more challenging because of the enormous number of everyday crime episodes in varied places. Though there are many established machine learning and deep learning techniques, law enforcement officers face challenges preventing crime from occurring promptly. An efficient way of law enforcement is required to lower the crime rates. This paper proposed an effective multi-module method for predicting crime using deep learning techniques. Our proposed method has two modules: Feature Level Fusion and Decision Level Fusion. The first module employs temporal-based Attention LSTM, Spatio-Temporal based Stacked Bidirectional LSTM, and Fusion model. The Fusion model leverages the prior two model’s training data. The temporal-based model is the source model for the transfer learning technique on the dataset of different cities. By applying this technique, the training time of the model is reduced. In the second module, the Spatio-Temporal based Attention-LSTM, Stacked Bidirectional LSTM, and the result of feature-level fusion module are used to get the final prediction. The proposed architecture predicts the next hour based on the data from the past twenty-four hours. The estimated number of crimes in any category for a particular location can be obtained as the output of our suggested model. It also enables law enforcement to get insight into future crime occurrences based on category, time, and location. This work concentrated mainly on the USA’s San Francisco and Chicago cities for the experimental analysis. For the San Francisco and Chicago datasets, our model has the Mean Absolute Error of 0.008, 0.02, the Coefficient of Determination of 0.95 and 0.94, and the Symmetric Mean Absolute Percentage Error of 1.03% and 0.6%, respectively. The proposed model outperforms numerous other well-known models.


I. INTRODUCTION
A NALYZING time-series data to extract meaningful statistics and other characteristics is the main target of Time Series Analysis. This information highly dominates in predicting future values based on previously observed data. Due to continuous urbanization and growing populations, violent crimes and accidents are rising. Extracting information and analyzing the hidden patterns, the co-relation between these vast amounts of data through Big Data Analysis (BDA) is one of the trending approaches nowadays [1], [2]. BDA can help revolutionize how authorities maintain and protect people from crime. As the population grows, crime activity patterns become more varied and complex. It requires new approaches to understand and tackle it. Finding the pattern, predicting the future crime, and matching these patterns with newly available data through data analysis are the main strategies to tackle the crime problems. Thus, predicting the likelihood of dealing with crime at a specific time or place will be convenient through the time-series-based BDA. The system can predict the possibility by looking for precipitating some causing factors such as geographical location, economic condition, time of the year, and environment [3], [4].
Crime is one of the most predominant and alarming ad-  versities in our society, and its prevention is a vital task [4]. Crime analysis and prediction is a systematic approach for analyzing and identifying different patterns, relations, and trends in crime and disorder [2]. Crime occurrence is common all over the world, mainly in cities. It hinders fundamental human rights and brings collapse to the structure of society. It is not always possible for the law enforcers to find out in which area the crime rate is high in a specific time manually. Because the motives of occurring crime are dynamic and there is no proper utilization of existing crime data. Suppose an automated system having higher accuracy of prediction is available to the law enforcers. In this case, they can take necessary precautions to decrease crime to a specific rate. Some commonly used techniques to predict and forecast crime data are: logistic regression, support vector machine (SVM), Naive Bayes, k-nearest neighbors (KNN), decision tree, multilayer perceptron (MLP), random forest, and eXtreme Gradient Boosting (XGBoost), time series analysis using LSTM and autoregressive integrated moving average (ARIMA) models. In recent years, deep-learning (DL) based models are also used for forecasting purposes such as-crime forecasting [1], [5], [9], air quality forecasting [6], wind speed forecasting [7], etc. Safat et al. [5] conducted an experiment employing all of the models that can predict more than 35 types of crime and provided a yearlong crime prediction. Feng et al. [1] did such a study using prophet model, neural network, and LSTM along with data visualization. This model works better only with the data from the previous three years. Sometimes LSTM arrangement makes it impossible for the memory cell to retain information over numerous time steps [8]. However, the model of Feng et al. needs more training data and data mining techniques to understand crime patterns better. Data mining is one of the fundamental techniques of BDA. It is innovative and growing research and helps deduce useful information and hidden patterns from data. It not only helps us in discovering new knowledge but also in enhancing our mastery of known ones [2]. Rayhan et al. [9] performed a Spatio-temporal Attentionbased study to predict crime of top 4 categories. Their model cannot predict categories having a small amount of data.
Since the prediction of crime can assist authorities in maintaining social order, a deep learning-based multi-module generic method for predicting crime in various cities is presented. This study forecasts the number of crimes of various categories in several districts for each city. It also analyzes how the trends impact crime occurrence. Locations were considered to picture the crime hot spots better. We processed the characteristics to obtain the required form and then chose the characteristics based on the correlation. After that, Our proposed model was trained and evaluated. There are three types of information in the dataset: categorical, temporal, and spatial. We developed the Attention-LSTM (ATTN-LSTM) model to process the categorical-temporal data and the Stacked Bidirectional LSTM (St-Bi-LSTM) model to process the spatial information. Predicting crime with the same model for multiple cities may result in a significant loss since attributes of one location may not have the same quantity of unique data as another. Hence, feature level fusion (FLF) and decision level fusion (DLF) modules were applied to overcome all the drawbacks of the current state-of-theart. Our goal throughout the study was to predict crime more effectively than previous methodologies employing time and particular location. Figure 1 shows an abstract depiction of this model. Our contributions for this work are the following: • We proposed a robust multi-module approach for dealing with category, geographical, and temporal information. A method for combining these characteristics into a single model utilizing two degrees of fusion was developed. • LSTM cells were fabricated with Swish activation to deal with the vanishing gradient issue and negative inputs in backward propagation to emphasize the utilization of LSTM. • A weighted system was implemented to determine the loss for the DLF module. The calculation is based on the inputs of this module. • A method for combining three models were devised in DLF. We utilized the majority vote approach and allpairs shortest distance to forecast each crime category based on the outputs of the three models. Hence, the suggested model predicted crime with minimal inaccuracy. Our code is uploaded at https://github.com/NowshinTasnim/ Spatio_Temporal_Crime_Prediction.git 1 .
The remaining portions of this paper are organized as follows. We reviewed our study on similar works in Section II and analyzed the data in Section III. The motivation of this work is provided in Section IV. Explanation of the developed architecture is given in Section V, and the working procedure for this study is discussed in Section VI. In Section VII, the results of the suggested work are analyzed. Finally, the paper is concluded in Section VIII.

II. LITERATURE REVIEW
With the ever-changing society, crime patterns are also changing. Moreover, crime occurrence is increasing. The traditional approaches are mostly out of date in this regard. Many deep learning-based supervised, semi-supervised, and unsupervised techniques are used to predict crimes and check the trends. The integration of modern technology in crime prediction helps the authority take necessary precautions.

A. BASELINE NETWORKS
Feed-forward ANN or multi-layer perceptron is most successful in time-series forecasting as it does not require the data distribution beforehand [10]. Using DL in time-series forecasting, the temporal dependence and structure can be learned easily [11]. Integrating Recurrent neural network (RNN) configuration in deep learning has changed the way of processing data in the case of forward-dependency networks. Moreover, this configuration can solve many real-world problems. LSTM can process an entire sequence, not only a single data point. So, while working with time-series data, the LSTM network is quite useful [5], [12]- [14]. Schuster et al. [15] used the concept of LSTM and improved it by coming up with the idea of Bidirectional LSTM (Bi-LSTM). The model can better understand the context by including feed-forward and feed-backward networks. Said et al. [16] and Kim et al. [17] showed that the use of Bi-LSTM enables to get the correlations and changed values of variables of the series simultaneously during the processing of multivariate time series data. The variation of each time-series data plays an important role in forecasting. Using Bi-LSTM, we can analyze these variations.
Said et al. [16] described the use of stacking several Bi-LSTM layers in multivariate time-series data in case of prediction. By using such layers, the model can learn about spatial-temporal features from the dataset and predict the 1 The code will be publicly available after acceptance next timestep more accurately [18], [19]. Sometimes these stacked layers reduce unnecessary information while processing the series of data. St-Bi-LSTM performed better in predicting the future from the sequence instead of using several LSTM layers [20]. So far, Bi-LSTM has performed outstandingly in diverse fields that include time-series-based forecast, COVID-19 case prediction, wind power forecast, and many more. [16], [17], [21].
A more appropriate way to reduce redundant data while processing is to use an Attention-based model. It takes the relevant parts of input data and removes the redundant data while performing a task. An attention-based model works well for sequence modeling as the distance of input and output layer does not affect the modeling dependencies [22], [23]. In most contexts, this type of mechanism is used with a RNN to create an encoder-decoder based sequence to sequence architecture [22], [24]- [28]. Though all the above mechanisms work smoothly for a small sequence, they are not effective enough for a long one. In such circumstances, the self-attention or intra-attention mechanism works better. This method creates a relation between several positions of a single data sequence to represent this single input. This representation helps in maintaining long-range dependency [29]. The path length of long-range dependency is generally the shortest to learn about the sequence easily [30]. Moreover, the execution of self-attention layers is faster than the RNN in most cases [29]. Attention and Self-attention mechanisms are mostly used for computer vision and natural language processing like speech enhancement, text prediction, summarising. [25], [31]- [34]. Nevertheless, only one approach is not enough for dependent data of varied features. It requires the fusion of models [35], [36].

B. CRIME PREDICTION AND CLASSIFICATION
BDA has shown tremendous results in criminological aspects for finding the trends and relationships between data [1]. Feng et al. [1] described how the stateful LSTM and Prophet model work in crime analysis. They tracked the crimes and predicted likelihood by following related facts and patterns. Their model works better with three years of data but cannot perform similar results for data with more extended time.
Spatio-temporal based RNN was introduced by Wang et al. [37], [38] to predict the total number of crimes. This method used a multi-factor crime prediction model consisting of adaptive hierarchical structured residual convolution units and non-convolution models. The layers are independent of each other in this model. They achieved a good accuracy but could not detect categories separately. They also used the internalization technique to address the resource consumption issue for its deployment in the real world [37]. However, the spatial data mining model proved to be an excellent approach to detect crime hotspots [39].
Agarwal et al. [40] and Tayal et al. [41] used the Kmeans clustering-based model to show the crime patterns based on year. It also has the same problem as Spatio based system. These clustering models can work with only the VOLUME 4, 2016 categories having more than a certain amount of data and cannot perform time-based analysis.
Rayhan et al. [9] developed an attention-based deep learning model capturing non-linear spatial dependence and temporal patterns of a particular crime category. The selfattention network can emphasize the dependencies among the features and get better results than uniform LSTM to predict the time-series-based data. They kept the model's fundamental structure interpretable. This model dynamically establishes a spatial-temporal association for each crime category based on prior crime occurrences and repeating crime trends. The limitation of this work is that the crime categories must have a large number of training data.
Kumar et al. [42] proposed a Naive-Bayes-based model for crime classification. Their method involves combining the history of some crime occurrences with incident-level crime data to identify the most likely criminal in a given incidence. It works well for some categories, but the cumulative accuracy was near 50%. A similar kind of approach was taken by ToppiReddy et al. [43] for crime classification. They used the KNN and Naïve Bayes classification system. They took the day information with the location to know whether a crime would occur and which category the occurred crime be.
An improved method of crime classification using Clustering approaches was proposed by Sivaranjani et al. [44] and Pednekar et al. [45]. They used the K-means, Agglomerative, and DBSCAN clustering techniques to make crime clusters. Then they merged the information of the three resultant clusters to predict the class. This approach has high accuracy, but calculating the exact time of the crime is an impossible job.
Gradient boosted decision tree worked better than KNN, Naive Bayes, and Random Forest Classification for correlation-based selected features and the categories with a large number of data. Feng et al. [2] used forecasting methods to generate data using crime trends of recent years. This model predicts crime categories for a given time and location using tree classification. It performs better than KNN and Naive Bayesian approaches. They merged the crime categories having a small amount of data. However, for a relationship with the trends, time-based analysis is necessary.
Different deep learning models perform better in a given sector for a specific criterion. However, for crime forecasting, the success rate of those models has not reached the equilibrium position. Summary of the current state-of-arts is presented in Table 1. There is still work to be done to address the deficiencies of today's state-of-the-art.

A. DATASET COLLECTION AND DESCRIPTION
We collected the crime data of San Francisco and Chicago, respectively, from the San Francisco city-county data portal [46], and the Chicago data portal [47] from 2003 to 2017.
There are 2, 115, 112 crime incidents for San Francisco and 5, 547, 827 crime incidents for Chicago. However, we are using data from 2004 to 2017 due to data deficiency in 2003. Hence, 1, 970, 039 crime incidents for San Francisco and 5, 071, 866 crime incidents for Chicago are in our dataset. The existing attributes are listed in Table 2.
The daily crime occurrence from 2004 to 2017 for San Francisco and Chicago are shown in Fig. 2. From the figures, we observe that the two cities' crime rate changes are not same. Moreover, the Chicago dataset shows seasonality. The crime rate in San Francisco has been almost the same over the 14 years. However, it is decreasing over the 14 years for Chicago, which is still higher than in San Francisco. Seasonal trends in data impact the whole model in time-series forecasting. In the case of crime prediction, these kinds of trends help to understand the pattern and find out the crime hot spots.
To anticipate crime, we need information from prior instances, such as the time and location of the crime and the sort of crime committed. So, for each city following existing features from the dataset was used: 1) Date -Date of the crime occurrence.
2) Time -Timestamp of the crime incident.
3) Categories / Primary type -Type of the crime that took place.

4) PdDistrict / District -Police Department District name
where the crime took place. The weather data of San Francisco and Chicago were taken from Wonder Weather Forecast [48] and used to check the temperature trends on the Kelvin scale. The publicly available dataset consists of hourly-based weather information. Hence, we utilized the average temperature of each hour from this dataset.

B. ANALYZING THE EXISTING FEATURES
After checking the correlation between features, we chose features having higher correlation from the acquired dataset. The information of year and month from the dates column of the San Francisco dataset and the datetime column of the Chicago dataset was retrieved. We visualized the monthly distribution of crimes during the 14 years using these ex-  tracted data, as shown in Fig. 3. From Fig. 3a, crime data distribution can be observed. It is high for January, March, May, August-October over the 14 years for San Francisco. In May, July, August, and October, the crime rate is high in Chicago, shown in Fig. 3b. Generally, in San Francisco, the temperature is high from August to October due to the summer season.
For the same reason, the temperature of Chicago is high in July and August. Hence, there is a possibility that temperature plays a role in crime occurrence. We can also see that the crime rate of San Francisco is high in March and May. The crime rate of Chicago is high in May and October, but the temperatures of these months are less than the summer season.
Analyzing the yearly number of crimes per police district for San Francisco and Chicago from Fig. 4, we can find the most crime occurring districts. Throughout the 14 years, the maximum number of crimes occurred in the Southern District of San Francisco and the district 8 and 11 of Chicago. These were the hot spots for crime in those years. With those facts, the deduction is that the environment of a district or area holds importance in crime occurrence. Thus, while predicting the crime occurrence, considering the districts as one of the critical features for the model is necessary. However, many of the existing models did not consider the space in the feature.

IV. TECHNICAL MOTIVATION
After reviewing the data, inevitable flaws are discovered in the current models. These flaws aspired us to establish a generalized model that will perform better than the current state-of-arts. To do so, we investigated the shortcomings of these models and presented several improvements.

A. CHUNKING OFF THE UNNECESSARY INSTANCES AND EMPHASIZE ON RELATED INSTANCES
Feng et al. [1], [2] trained their model using the whole city as a location. However, designating the region of a city plays an essential function in controlling crime incidence. In our work, the input features include the information of the locations utilizing the police department district for each city. We needed to implement Bi-LSTM and ATTN-LSTM on the model to deal more efficiently with the area-based information.
Our model has a St-Bi-LSTM layer to train the information of the police department of the cities. It also has an Attentionbased sub-model to learn the temporal aspects of the cities. Bidirectional RNN has two layers side-by-side. The second layer is a replica of the network's first recurrent layer. In the first layer, the input sequence is the provided input. The input sequence in the second layer is the reverse of the provided input. Even if the input sequence is very long, the chances of losing any information from the whole input become pretty minimal here. For RNN, the depth of the network is more important than the memory cells of a layer. The depth of our model is boosted by stacking two Bi-LSTM layers. Chunking off some unnecessary observed instances of the first layer in the second layer, this model helps the network to predict better than the network having one Bi-LSTM layer with the same number of memory cells. Bi-LSTM also solves the vanishing and exploding gradient problem that vanilla RNN has. Furthermore, it provides considerably cleaner backpropagation compared to vanilla RNN.
The attention model overcomes the limitation of encoderdecoder-based approaches. This model examines the rela-tionships between the nodes and maintains the nodes that play a significant role in creating the output. As a result, the model selects the appropriate nodes for training and minimizes the input size of the next layer [29]. It is important to keep only the relevant instances in sequential data training and remove unnecessary observed instances to produce better output.

B. ACCELERATING CONVERGENCE IN LEARNING PROCESS USING TRANSFER LEARNING
Rayhan et al. [9], Feng et al. [1], [2], Wang et al. [37], [38] made remarkable progress in crime analysis and forecasting. However, their models require a longer compilation time for the large datasets. Moreover, these models cannot perform well if the dataset contains data spanning more than a decade. Each city in our collection had a massive quantity of data. To address the compilation time issue, employing the transfer learning technique to converge a learning process is a smart option. In the case of solving a new problem, a model can use the previously trained model knowledge if the two are comparable. This action is called Transfer learning. If there is not a decent amount of data to train or the training time is excessive, a transfer learning technology can help solve these problems. It makes the learning process faster and increases accuracy [49]. For example, D Source and D T arget are two different domains having learning task T Source and T T arget , respectively where T Source = T T arget . However, the knowledge of the source model is similar to the target model. Hence, one can use this knowledge of the first model for the second model by transferring (Fig. 5). Here, both of the models produce different outputs. In this work, the temporal features of the Chicago dataset are learned by applying the knowledge from the ATTN-LSTM model for the temporal features of San Francisco. It reduced the training time.

C. GENERALIZING THE MODEL
Some developed models used FLF to get a generalized performance. However, to get a better result, DLF is also required [2], [9], [38]. Our model employed two levels of fusion. The Fusion approach can transform weak learners who are marginally better than a random guess into strong aggregated learners who can make correct predictions. Furthermore, the generalization power is far greater than that of base learners [50]. In our work, the proposed model is VOLUME 4, 2016 a generalized model that can predict crime of any location without losing its efficacy. The different cities have different values for a single feature. Our model deals with these various values by using FLF. It also has a DLF module to get a better prediction result.

V. PROPOSED ARCHITECTURE
DL approaches work phenomenally for time-series-based forecasting. We developed a DL-based architecture for predicting crime taking inspiration from our study. Fig. 6 depicts the proposed architecture. The whole architecture is divided into four sub-models. The proposed work uses the St-Bi-LSTM model, ATTN-LSTM model, and two levels of Fusion models. This novel architecture can overcome the issues of the current state-of-arts.
In our work, the LSTM cells of the Bi-LSTM layers and ATTN-LSTM layers are designed using the Swish activation function to avoid particular concerns with our dataset.  following:

2) Swish Activation:
Swish is a smooth, non-monolithic function that matches or outperforms ReLU in different machine learning problems. It is derived from SILU activation. The equation for Swish is the following: Where, β is a trainable parameter. Swish can be termed as v time sigmoid βv, and this function does not have a vanishing gradient problem. Also, ReLU produces 0 output for negative inputs. It cannot be back-propagated where Swish can partially handle this problem. The description of the sub-models is given in the following.

A. STACKED BIDIRECTIONAL LONG SHORT-TERM MEMORY (ST-BI-LSTM)
The St-Bi-LSTM model is designed with two TimeDistributed wrapped Dense layers for training the geographical, temporal and derived features of the cities.
A TimeDistributed wrapper predicts one value per timestep for the whole input sequence. It allows applying the mentioned layer to each part of a sequence. So, this requires that the Bi-LSTM hidden layer returns a sequence of values (one per timestep) rather than a single value for the whole input sequence. Two Dense layers with TimeDistributed Wrapper minimize the size of the input of those layers and consider each part of the sequence while doing so. Finally, we got the desired sequence of output for this model.
The architecture of this model can be seen in Fig. 8. Some dominant parameters for this model are given in Table 3.

B. ATTENTION BASED LONG SHORT-TERM MEMORY (ATTN-LSTM)
In Fig. 9, the architecture for an encoder-decoder model with ATTN-LSTM is shown. Here, nf indicates the number of features. The value of nf is 19 for Spatio-temporal features and 17 for the architecture with only the categorical-temporal feature. At first, the features are given as input of an LSTM layer which serves as an encoder. Then the encoded output is passed to the multiplicative self Attn-LSTM layer to decode the data sequence. A multiplicative self-sequential-attention layer performs better for sequential data input than a vanilla attention layer. First, the input sequence is taken and matched as a row and column of VOLUME 4, 2016 a matrix. Then the hidden states are calculated. The hidden state vectors (h) are the sequence of specific features of the input. After that, the context vector (l) is computed using the weighted sum of the h vectors. The attention vector (e) gives the output score of the feed-forward neural network. Applying Softmax to calculate the weights (a), the scores of the attention vector will be distributed fairly. The equations for the multiplicative self-attention layer are following: Where, a holds the attention weights, e is the attention vector, l is the context vector, and h holds the scores.
The decoded output is passed to a dense layer with a TimeDistributed wrapper so that each part of the sequence is processed separately. This layer is followed by another dense layer to downstream the output sequence. The main parameters for this model are given in Table 4.

D. DECISION LEVEL FUSION (DLF) MODULE
To avoid the problems due to anomalies in the data and get more accurate results, the DLF module is introduced.
Here, the ATTN-LSTM (for Spatio-temporal features), St-Bi-LSTM, and FLF-all these three models are taken to get the final prediction. This sub-model is shown in Fig. 6 as the DLF module. DLF chooses the more appropriate outcome among the three outputs. For the DLF module, majority voting and the all-pairs shortest distance are implemented to find our proposed model's final crime prediction. At first, the module checks if the predictions for each crime category support the majority voting system. Otherwise, the all-pairs shortest distance is applied for selecting the output. This module uses a weighted average system to measure the loss. The following is a short description of the prediction and loss calculation techniques in the DLF module.

1) Majority Vote
In the case of a majority vote, if an element frequently occurs and more than the rest of the inputs, it is a majority element. In our work, if the predictions of a category for any two models are same, it is the final output. The equation of majority vote is: Where, m denotes the number of models;ŷ denotes the prediction of a specific hour for those models; c is the crime category; C * is the majority vote for the c crime.

2) All-Pairs Shortest Distance
Finding the minimum distance between any two elements is the intention of all-pairs shortest distance technique. Sometimes to calculate the shortest distance, the distance from the mean of the elements is used [51]. Our DLF model uses the deviation from the mean to find the final prediction for each category among the 3 outcomes. The equation is:

3) Weighted Average Loss
This technique prioritizes some values by assigning weights. In this work, the DLF module calculates the average loss for each city using the weighted average loss for each district of that city. It calculates the weight for each model by checking the proximity of the prediction of each category.
The equations for calculating loss are: Where, L is the PdDistrict.
Where, p denotes the number of total PdDistricts for a city.

A. EXPERIMENTAL SETUP
For execution purposes, we used the google colab platform for this work. The specifications of the Google Colab platform are: • 1xTesla K80 (2496 CUDA cores) • 1xsingle core hyperthreaded Xeon Processors @2.3Ghz • 13 GB RAM • 108 GB Run time HDD • OS: Linux Kernel Short experiments like validating the code's functionality were performed on a desktop computer.

B. FEATURE SELECTION METHOD
Features play a vital role in predictive models. The model predicts more accurately when the correlation between features and the target value is strong. In this study, the R-value is used to measure this correlation. The equation for R-value is given below: Where, r is the correlation coefficient, f i is the values of the feature variable in a sample,f is the mean of the values of the feature variable, t i is the values of the target variable in a sample,t is the mean of the values of the target variable, and ns is the total number of samples in the dataset.

C. DATA PRE-PROCESSING
The data were not in the form we expected. Hence, we processed the data to get the required formation. The steps of pre-processing are given in the following.
A sorted hourly-based DateTime column was made for all the data by merging the Date and Time columns. Information about the day, year, and the week was extracted from the date. Eq. (12) and (13) encoded this information into a signal. The equations for cyclic features are the following: Where, s = 86400 seconds for a day, s = 604800 seconds for a week, and s = 220898664 seconds for a year These signals help to co-relate the periodical nature with the data. After this step, the features of missing timestamps within the chosen 14 years range were masked using the ones of the first timestamp of the dataset.
From the weather data, we took the average temperature (Kelvin) per hour. It was divided into 3 categories according to (14). Here, 0 indicates low, 1 indicates medium, and 2 indicates high. This categorized temperature information is the derived feature for our hourly timestamp-based crime data.
There are different unique values for the two features-Category and District. For these two cities, some of the data pre-processing steps are different. These steps are described below in respect of cities:

1) San Francisco Crime Data
The total number of crimes per category was analyzed to find the 10 top ones out of 38 crimes which are shown in Fig.  10a. The results showed that "Larceny / Theft" occurred more than other crimes. The second highest crime occurrence was "Other offenses". The third-highest one was "Noncriminal". During the analysis, it was found that the lowest total number of crimes for a category was 14. Many categories had less than 100, 000, except the top 7. For balancing the dataset, VOLUME 4, 2016 these low count categories were merged. There were 38 categories in total. Each category was mapped with a unique number from 0 to 37. We divided the categories with fewer data into 3 groups for merging, leaving the top 7 categories. Clustering approaches were used to make these groups -GRP0, GRP1, and GRP2. GRP0 consists of warrants, burglary; GRP1 consists of suspicious occ, robbery, missing person, fraud; and GRP2 consists of the rest categories. After merging, there were 10 types of crime categories. These Categories' name and mapped ids are: larceny/theft (1), other offenses (2), non-criminal (3), assault (4), vehicle theft (5), drug/narcotic (6), vandalism (7), GRP0, GRP1, and GRP2. After that, the unique PdDistricts name was converted to a unique numerical value as PdId ranging from 1 to 10. Then, we counted each category's crimes based on hourly timestamp and police department district id.

2) Chicago Crime Data
After analyzing the total number of crimes per category, the 10 top ones out of 31 crimes are acquired. These crimes are shown in Fig. 10b. The results showed that "Larceny / Theft" occurred more than other crimes. The second-highest crime occurrence was "Battery". The third-highest one was "Criminal damage". In the case of the Chicago dataset, the lowest total number of crimes for a category is 11, and there are many categories having data less than 300, 000, except the top 7. We merged these categories to have balanced data. The 31 categories in this dataset were mapped with a unique number from 0 to 30. The smaller categories were merged into 3 groups, the same as San Francisco data. The GRP0 consists of deceptive practices and vehicle theft. The GRP1 consists of robbery and criminal trespass, and GRP2 consists of the rest categories. Therefore, the 10 categories name and their mapped ids are larceny/theft (1), battery (2), criminal damage (3), drug/narcotic (4), assault (5), other offenses (6), burglary (7), GRP0, GRP1, and GRP2. So, the acquired data from both datasets are the following: 1) PdDistrict id -Unique id given for unique police department districts. 2) Cyclic encoded the day, week, and year information data. 3) Count of crime per category for the hourly timestamp. 4) Categorized temperature for the hourly timestamp.
Among these four features, PdDistrict id and temperature differ from city to city. Hence, we excluded these two features in the transfer learning technique to create a generalized model. Since these two features are among the main factors of crime, these features are processed using another model. The correlation between acquired features and the crime category is shown in Table 5 using (11).

D. WINDOW GENERATION
The acquired data was split into train, test, and validation set in a ratio of 7:1:2. Since our dataset is time-series-based, the first 70% data are for training, the last 10% data are for testing, and the rest of the 20% is for the validation. Then windows of data having 24 × 17 inputs and 24 × 19 inputs for each window were generated. The windows mainly represent the input sequences. The 24 denotes the data of 24 hours of a day. We mapped the data of 24 sequential hours per sequence for input. The model predicted the same data sequence length after moving one hour ahead from the starting of the input sequence. Thus, the model is taking 24 hours data as an input data point and is predicting data of the next hour and the previous 23 hours as shown in Fig. 11. In 24 × 19, the 19 denotes the 19 features to predict in the next time step. These features are the encoded signals of day, week, and year, police department district id, temperature category, previous hour's count of 10 categories, and the total number of crimes in the last hour. In 24 × 17, 17 denotes the number of features. These features do not include police department district id and temperature category.

E. TRAINING METHOD
Our main goal was to predict the number of each crime category using Spatio-temporal information. To do so, we trained the models according to Algorithm 1 using some evaluation metrics. The generated windows are applied in the input of the models according to the requirements. Our model takes 24 sequential hours data and predicts the crimes in the next hour (25 th hour). After going through this algorithm, we have The Early stopping monitored the validation loss and stopped the training if the same loss occurred 10 times consecutively within 100 epochs. It also restored the model weights from the best value of validation loss. The MAE loss of Bayview for the training and validation dataset in the FLF module is given in Fig. 12. The amount of loss is decreasing with the increasing epochs. At the end of the training, the losses are near 0.02 for both data samples. Moreover, the MAE values for other areas are very low for training data. It indicates that our proposed model has good training accuracy on test and validation data.

1) Mean Absolute Error (MAE)
MAE is an estimator of the mean deviation of the observed value from actual values. For example, there are n data points in a sample,ŷ and y represent the vector of prediction values and the vector of True values, respectively. Therefore, the equation for M AE is:

2) Mean Squared Error (MSE)
It estimates the mean of the squared deviation of estimated values from the true values. Suppose there are n data points in a sample. We have generated the prediction vector for all the data of this sample. For this case, the M SE is computed as the equation given below:

3) Mean Squared Logarithmic Error (MSLE)
MSLE estimates the ratio between estimated values and the true values. In this case, the equation for M SLE is given below:

4) Coefficient of Determination (R 2 )
R 2 is the ratio of the variation of the predicted variable. It indicates the closeness of actual and predicted values.  Suppose the ratio of the sum of square regression (SSR) and the total sum of square (SST) tends towards zero, the fitness of a model increases. For n data points in a sample,ȳ is the mean of all true values. In this case, the equation for R 2 is:

5) Symmetric Mean Absolute Percentage Error (SMAPE)
SMAPE is a percentage (or relative) error-based accuracy measure. It helps to analyze the sensitivity of seasonal timeseries-based forecasting [52]. The model predicts better when the value of SMAPE is closer to zero. The equation of SMAPE is given below:

G. FINAL OUTPUT GENERATION
The outputs of the ATTN-LSTM (where nf = 19), FLF module, and stream B are fused such that the DLF model had all the necessary information. This resulted in our model predicting each crime category's count better than the current state-of-arts. In Algorithm 2, the way of merging the outputs of 3 different models in the DLF module is described. The model generates the final result by applying (7), (8) and calculates the final loss using (9). The value of n is 3 for these equations as there are 3 models to get the result of DLF. At first, this module checks if the inputs support the majority vote. Otherwise, it applies all-pairs shortest distance to get the : Spatio-temporal based prediction of each crime category using majority voting or allpairs shortest distance on the DLF level 10 Prediction in DLF: while Each Input data do 11 Final Prediction: Get the final prediction using majority vote given in (7) or all-pairs shortest distance given in (8) (9), and calculate the loss for each city using (10); final output. Then, the model calculated the loss of the final output using our weighted system. After doing all the steps according to this algorithm, the number of crimes per category for a specific hour and area is forecast.

A. RESULT OF PROPOSED ARCHITECTURE
The idea of time-series forecasting with deep learning models was amalgamated to predict crime. All models were eval-  (19). Finally, the evaluation metrics of each city for our proposed method using (9) and (10) is shown in Tables 6 and 7. Temperature is the derived feature in our model. Evaluation metrics for submodels is listed in the appendix in Tables 9 and 10 Table 9. The MAE loss is from 0.0101 to 0.0299 and the R 2 value is from 0.9643 to 0.9765 for those places. However, the values of SMAPE are the lowest for For this dataset, the temporal-based ATTN-LSTM model had the highest R 2 , MAE and SMAPE value concerning Spatio-temporal based models. The values of R 2 , MAE and SMAPE are 0.988, 0.12, and around 9%, respectively. Here, we took the whole city as one location. However, the crime patterns are different for areas under a city.
The evaluation results of our proposed model for each location of San Francisco are given in Table 6. After observing the outcomes of both tables, we concluded that the loss in the proposed model for this dataset is relatively less than the rest of the sub-models. The MAE loss of test and validation dataset in crime prediction for each area is less than 0.02 here. In the case of MSLE and MSE, the values of the proposed model for each area are less than 0.003 and 0.027, respectively. These values are way lower than the submodels. The SMAPE value for each location is also less than the sub-models. This value is between 0.61% to 2.42%. Our model fused the categorical, temporal, and spatial information by using 3 different sub-models in the DLF. This module took the best outcome from those sub-models using majority vote (7) and all pair shortest distance (8). We calculated the weighted average loss (9) of this module based on the inputs. So, the loss is less than the rest of the models. The value of R 2 for each location of this city is between 92% to 97%. So, our model worked as intended for the San Francisco dataset.
The loss of proposed model and sub-models for the Chicago dataset are given in Tables 7, 10. In the case of this dataset, ATTN-LSTM performed better for some places. St-Bi-LSTM using the transfer learning technique has fared better for the remaining ones among the sub-models. After comparing the values of MAE, MSLE, MSE, SMAPE and R 2 , it is clear that our proposed model worked better than most of the sub-models for this dataset. The losses and the R 2 value are better here for the same reason as the San Francisco dataset. For the proposed model, the value of MAE is less than 0.035, MSLE is less than 0.015, MSE is less than 0.55, SMAPE is between 0.4 to 1.36 and R 2 is between 84% to 98% for each location.
The average loss for each city is calculated by combining the losses of each region using (10). These values are listed in Table 8. For the San Francisco dataset, the cumulative average MAE loss is 0.0082, MSLE loss is 0.002, MSE losses are 0.0132 (test data), 0.0123 (validation data), SMAPE is 1.03 and R 2 is 0.955. The cumulative average MAE loss is 0.02, MSLE loss is 0.008, MSE loss is 0.027, SMAPE is 0.57 and R 2 is 0.94 for the Chicago dataset. From these values, we can state that our model had a satisfactory amount of losses and R 2 . Thus, introducing the DLF module into the model was a good decision.
A transfer learning technique is implemented to train some features in the Chicago dataset. In this case, the source model is the Attn-model of the San-Francisco dataset. By applying this technique, the training time is decreased by 30 minutes, considering the other sub-models.
If any of the sub-models are not added to our model, the model would not co-relate the different types of features, avoid unnecessary instances and emphasis on related instances, or reduce training time through learning convergence. Furthermore, our goal of making a generalized model would not be fulfilled.
Some predictions of our model are given in Fig. 13. The predictions for an area are plotted using a pie chart here. Fig.  13a is the prediction of crime for Bayview of San Francisco. Fig. 13b is the visualization of the crime prediction for Area-2.0 of Chicago. In these figures, one can observe the predicted percentage of each crime category for an hour for a specific area of both cities. Here, the "Other offenses" percentage is highest for Bayview of San Francisco, and "Drug/Narcotic" is highest for Area-2.0 of Chicago. From these percentages, the law enforcers can know the crime rate for each category for an hour. They will be able to take necessary precautions to reduce the crime occurrence. Our model can also predict the number of crimes per category for these areas.

B. COMPARATIVE STUDY
In The predicted number of crimes for a specific hour for all the 6 models can be distinguished from Fig. 14. The blue bar denotes the true/actual value. We took Area-6.0 of Chicago for this visualization. This visualization shows that our model predicted the number of crimes closer to the true value. The actual value is 2. Our model predicted a little higher than the actual value. In contrast, the model of Wang et al. [38] predicted a little lower than the actual value. The attention-based model of Rayhan et al. [9] predicted close to 3. The SARIMAX model predicted 5 crimes for the specific hour. The models of Feng et al. [1] and FBProphet predicted the highest value among these 6 models. The predicted to-tal number of crimes is approximately 9 for these models. Hence, this model does not perform well. Based on this prediction, we can conclude that our model outperformed the other 5 models.
The mentioned state-of-arts have limitations regarding the length of data [1], [9], [38]. Our model can manage data from over a decade without losing any plausibility. We employed the ATTN-LSTM model and the transfer learning technique in the St-Bi-LSTM model to deal with this problem.
A few models [1] did not consider Spatio features of a city. We utilized the knowledge of a city's districts or regions and categorical-temporal information to forecast crime. So, our model has no limitations regarding Spatio features.
All four models have FLF in the architecture. However, our model has two Fusion module levels-FLF and DLF. The FLF module merges the Spatio-temporal features along with the categorical information. The model learns more about the different outcomes from Spatio-temporal based sub-models in the DLF module. Hence, the model can predict more accurately the number of crimes for each category and each location with the help of this learning. Finally, we can state that our proposed model is an effective method for forecasting data.

VIII. CONCLUSIONS AND FUTURE WORK
In this work, the fusion technique is applied to predict crime in an hourly timescale for two cities of the USA. Introducing the DLF module in our architecture helped to predict the best result. Moreover, the use of the Transfer learning technique reduced the training time to a certain amount. Our model can predict crime from Spatio-temporal based Categorical data and has overcome all the limitations of the current state-ofthe-art.
Although our model performs well, there are some disadvantages, such as the training time of the whole system being more than moderate. Due to the small amount of data in some categories, we need to re-categorize them into 3 groups.
In the future, to overcome these problems, we plan to develop a model which requires less time to train and can also work with a small amount of data in a category.

ACKNOWLEDGMENT
This work is supported in part by Khulna University of Engineering & Technology (KUET), Bangladesh, and in part by CSE, KUET Alumni. The authors would like to thank all the people who have given their suggestions to achieve a great result. They also want to thank the reviewers for their precious opinions on this work.

CONFLICTS OF INTEREST
The authors indicate that there are no conflicts of interest with relation to the publication.