A Lightweight Multitask Learning Model With Adaptive Loss Balance for Tropical Cyclone Intensity and Size Estimation

Accurate tropical cyclone (TC) intensity and size estimation are key in disaster management and prevention. While great breakthroughs have been made in TC intensity estimation research, there is currently a lack of research on TC size reflecting TC influence radius. Therefore, we propose a lightweight multi-task learning model (TC-MTLNet) with adaptive loss balance to simultaneously estimate TC intensity and size. Adaptive loss balance is utilized to solve the problem of inconsistent convergence speed of TC intensity and size estimation tasks. The model based on four 2-D convolutions, four 3-D convolutions and three fully connected layers takes up less computational and storage space and improves the accuracy of TC intensity and size estimation by sharing knowledge among multiple tasks. In addition, due to the imbalanced distribution of TC samples, with significantly few low-intensity and high-intensity TC satellite data, this phenomenon poses a great challenge to TC intensity and size estimation. So, we utilize the influence of nearby samples to calibrate the sample density to weight the loss function to enable the model to be generalized to all samples. The result shows that the root-mean-square error (RMSE) of TC intensity estimation is $\text{8.40}\,\text{kts}$, which is 33.5% lower than that of the Advanced Dvorak Technique (ADT) and 11.4% lower than that of the deep learning method (3DAttentionTCNet). The mean absolute error (MAE) of the TC size estimation is $\text{20.89}\,\text{nmi}$, which is a 16% reduction compared to the Multi-Platform Tropical Cyclone Surface Winds Analysis (MTCSWA).

eventually dissipating at sea or after moving over land. Landfalling TCs are accompanied by severe weather such as strong wind, rainstorms, and storm surges, which can cause significant loss of life and property [1], [2]. TC intensity is defined as the maximum average wind speed near the TC center. It is an important parameter that measures the destructive power of TC and is used in TC warning, prevention, and management. Accurate TC intensity estimation also helps to predict the rapid intensification of TC intensity. TC size indicates the radius of TC influence. TC size is usually measured by several wind radii provided by the forecast centers, including the gale-radius (35 kts, R35), storm-radius (50 kts, R50), hurricane-radius (64 kts, R64) and the radius of maximum wind (RMW) [3], [4], [5], [6], [7]. R35 represents the potential impact area of a TC and is one of the most widely used parameters to predict TC influence and mitigate TC impact. Thus, the present study conducts TC size estimations based on R35.
Obtaining TC observations is difficult because TCs spend most of their lifetime over the ocean, where deploying observation equipment is challenging. Therefore, aircraft and ships are used to obtain TC observations at sea. However, these observational methods are very expensive. With the development of artificial intelligence, satellite imagery has become the main source of information for TC intensity estimation. Although satellite imagery cannot directly measure TC intensity, it can be estimated indirectly through the captured cloud structure [8], [9]. For example, infrared satellite imagery provides the temperature distribution of radiation surfaces, water vapor satellite imagery provides information on the water vapor in the clouds, and microwave satellite imagery provides information such as the TC eye, eyewall and spiral rainbands. TC intensity estimation using satellite imagery is based on the fact that TCs of similar intensities have similar cloud structures. In meteorology, the main TC intensity estimation methods include the Dvorak technique [10], the advanced Dvorak technique (ADT) [11], the deviation angle variance technique (DAV) [12], and the satellite consensus technique (SATCON) [13], [14], [15]. These methods rely on artificial experience or various algorithms to obtain features related to TC intensity and then use regression models to obtain TC intensity. However, the cloud structure features related to TC intensity determined by humans are subjective. In addition, the design of feature extraction algorithms requires expertise, which greatly limits TC intensity estimation. With the development of deep learning technology, intensity estimation This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ technology has gradually improved. However, there are still challenges in TC intensity estimation based on convolutional neural network (CNN) [16], [17], [18], [19], [20], [21], [22]. For example, the dependence of spatial information and the correlation of channel information fail to take into account. Due to the lack of low-intensity and high-intensity TC observation data, low-intensity TCs are overestimated and high-intensity TCs are underestimated. At present, TC intensity estimation methods focus more on improving accuracy and ignoring the study of model interpretability.
Obtaining in-situ measurements during TC events is difficult. Therefore, other observation methods are used for TC size estimation, such as satellite [23], [24], [25], [26], [27], scatterometer [28] and microwave sounder [29]. There is no technology as widely used as the Dvorak method for TC size estimation because the physical environment information involved in TC size estimation is very complex. Recently, more and more researchers have started applying deep learning to estimate TC size. However, obtaining accurate estimations is still a challenge due to the complexity of the physical relationship between convection and wind field.
Currently, there is a lack of studies that simultaneously estimate TC intensity and size. We design a lightweight multitask learning model (TC-MTLNet) with parallel dual attention to estimate TC intensity and size intensity simultaneously. We employ 2-D convolution and 3-D convolution, respectively, to extract features from the infrared channel and multichannel satellite imagery. Before the fusion of spatial features and channel features, the position attention module (PAM) and channel attention module (CAM) are applied to the spatial dimension and channel dimension, respectively. This dual attention considers the dependency between the feature map pixels and the correlation between the adjacent channels and allows local features and global dependencies to be effectively integrated to improve the accuracy of TC intensity and size estimation. Then, we incorporate environmental factors into the fully connected layer after the spatial and channel features are fused and flattened into one dimension, which effectively provides features related to TC intensity and size. Subsequently, we use two branches to learn TC intensity and size separately, while using minimum sea-level pressure (MSLP) as an auxiliary task to provide additional feature information for TC intensity and size estimation. Our main contributions are as follows.
1) Before the two branches are combined, we apply dual attention to improve feature representation, thus improving the model's performance. The PAM represents the interdependence between spatial features on the feature maps, and the CAM focuses on the correlation of channels. 2) Adaptive loss balance method is utilized to solve inconsistent TC intensity and size convergence speed in multi-task learning. The method automatically balances the training speed of multiple tasks by dynamically adjusting the gradients. This method enables multiple tasks to learn at a similar speed to avoid multi-task learning being dominated by one task. 3) Concerning the overestimation and underestimation of TC intensity and size due to unevenly distributed TC samples, we employ the label distribution smoothing method, which convolves the symmetric kernel with the empirical density distribution to acquire the effective label density distribution that reflects the real sample unevenness. Finally, the loss function is designed based on the effective label density to alleviate the problem of overestimation and underestimation. 4) We explore model interpretability by deep learning visualization techniques. First of all, by visualizing the feature maps, we can understand the features learned by the convolution kernels. Then, we use the deconvolution technique to visualize the feature maps generated by the filters to understand the role of the filter. Finally, by visualizing the high-dimensional features through Grad-CAM++ technology, we can understand the contribution of each part in the satellite imagery to the intensity and size estimation. The rest of this paper is organized as follows. The next section presents the research status of TC intensity and size estimation based on traditional and deep learning methods. Section III introduces the data, data preprocessing and the architectural details of the model used. The experimental results are discussed in Sections IV and V summarizes the research.

A. TC Intensity Estimation Based on Traditional Meteorological Methods
The Dvorak technique is based on infrared satellite imagery to estimate TC intensity. This approach correlates TC intensity with its inner core, vertical motion patterns and outer cloud structures, assuming that TCs with similar cloud structures have the same intensity. However, the Dvorak technique is highly subjective as meteorologists estimate TC intensity by manually analyzing the TC cloud structure from satellite imagery. The ADT has been developed to overcome the limitations of the Dvorak technique. This method eliminates subjectivity in intensity estimation by using the objective storm center determination scheme and cloud pattern determination logic. The DAV technique takes the organizational level of infrared TC cloud features as an indirect measurement of TC intensity. This method applies the DAV to quantify the degree of symmetry of cloud clusters and considers that similar intensity TCs have a similar degree of cloud clusters symmetry. However, this method requires a prior TC center, and errors in the location may lead to inaccurate TC intensity estimation. The Advanced Microwave Sounding Unit (AMSU) is based on the relationship between the brightness temperature of each channel with TC intensity. The SATCON technique is a weighted estimation scheme that sums up the ADT and several TC intensity estimation methods to minimize the weaknesses of each method.

B. TC Intensity Estimation Based on Deep Learning Methods
With the development of artificial intelligence, TC intensity estimation based on deep learning has been increasingly applied. Pradhan et al. [30] were the first to apply deep learning to estimate TC intensity. However, they manually deleted the satellite imagery of poor quality, so the designed network lacked applicability. Chen et al. [31] designed a novel CNN model, which proposed and verified the rotational invariance of TC. Chen et al. [32] proposed TC intensity estimation model based on multi-channel satellite imagery, and integrated basin information and location information into the network. Lee et al. [33] designed a multi-dimensional CNN to estimate TC intensity from geostationary satellite data and analyzed extracted intensity features from multispectral infrared imagery through heat maps. Zhuo et al. [34] proposed a network for TC intensity and size estimation based on physical enhancement. With the augmentation of auxiliary physical information and tasks, the network achieved good performance. Tian et al. [35] designed a 3D CNN to extract TC features from multi-channel satellite imagery, which improved the model's ability to focus on features related to TC intensity by CBAM. Sun et al. [36], [37] proposed novel image retrieval methods, which can mine similar TCs to further improve the accuracy of TC intensity estimation. Duan et al. [38], [39] proposed a self-supervised learning method to learn the deep neural network from unlabelled hyperspectral data and a novel hyperspectral image classification framework using the fusion of dual spatial information. These methods he proposed can enhance the extraction of TC intensity and size features, which can further improve the accuracy of TC intensity and size estimation. Tan et al. [40] designed a model incorporating a residual learning mechanism and an attention mechanism. However, these methods for TC intensity estimation still suffer shortcomings. For example, the dependence of spatial feature information and the correlation of channel feature information is not considered. Due to the lack of observation data of lowand high-intensity TC, low-intensity TC is overestimated and high-intensity TC is underestimated. In addition, intensity estimation methods focus more on improving accuracy but ignore the study of interpretability of models.

C. TC Size Estimation Based on Traditional Meteorological Methods
Demuth et al. [41] employed parameters derived from the AMSU data to measure TC intensity (maximum sustained winds and MSLP) and size (34 kts, 50 kts and 64 kts wind radius). However, the resolution of these data is too low to adequately reflect the TC structure. In order to overcome the limitations of this technique, many methods (e.g. [6], [42]) based on satellite imagery are designed to estimate TC size. Knaff et al. [3] developed an automated, objective, MultipleSatellite-Platform Tropical Cyclone Surface Wind Analysis (MTCSWA) [3] method for TC size estimation that allows variable data weights to be applied to the input data. The combination of overall quality control and weighted variational analysis yielded smaller errors in TC size estimation.

D. TC Size Estimation Based on Deep Learning Methods
At present, deep learning has entered many research fields. Motivated by the positive progress of deep learning in TC intensity estimation [30], [31], [32], [33], [34], [35], [40], recent studies have begun to explore the application of deep learning to TC size estimation. Zhuo et al. [34] were the first to estimate TC size using deep learning. They designed a physics-augmented  I  DETAILS OF THE IR, WV, PMW CHANNEL SATELLITE IMAGERY multi-task learning model to estimate TC size and found that learning multiple wind radius tasks and auxiliary intensity estimation tasks simultaneously yielded more accurate TC size estimation. Baek et al. [43] developed a novel multi-task learning model (tc-sem) in the western north pacific for TC size estimation, which improved the accuracy of TC size estimation through knowledge sharing among multiple related tasks.

III. DATA AND METHODOLOGY
A. Data and Data Preprocessing 1) Data: The TCIR [31] dataset from 2003 to 2017 is used in our study. This dataset provides infrared (IR), water vapor (WV), visible (VIS) and passive microwave rain rate (PMW) channel satellite imagery. Since the visible channel is very noisy at night, this study is based on IR, WV and PMW channels. The details of the satellite imagery are shown in Table I. The spatial resolution of the IR and WV channels is 0.07 • × 0.07 • , and the spatial resolution of the PMW channel is 0.25 • × 0.25 • . Therefore, the PMW channel is enlarged by about 4 times using linear interpolation to have uniform channel sizes. All imageries are 201 × 201 pixels, and the distance between the two data points is 4 km. In addition, the TCIR dataset provides TC center location, intensity size and minimum sea-level pressure (MSLP). The data from 2003 to 2016 is used for training, testing, and validation, and the data from 2017 is used to evaluate the performance of the best-optimized model. Due to the rotation characteristic of TC, we rotate satellite imagery 10 times (0 • , 36 • , 72 • , 108 • , etc.) during testing and validation, and then take the average estimated value of 10 satellite imageries as the TC estimated value to achieve the more stable performance of the model.
2) Data Preprocessing: In order to improve generalization, we design three data enhancement methods.
1) To retain the TC wind eye and the TC cloud structure and omit irrelevant features in the outer region, we cut the satellite imagery in the center area from 201 × 201 to 141 × 141, which has been proved to be the most effective area for TC intensity and size estimation. 2) We apply the z-score standardized method to TC data.
The data are uniformly converted to the same magnitude by subtracting the mean and dividing by the standard deviation to improve the comparability of the data. 3) Given the rotation invariance of TC, the satellite imagery is randomly rotated, which is helpful to improve the generalization ability of our model.

B. Methodology
In this section, we first give an overview of the model and then give a detailed introduction to each module. We design a parallel dual-attention TC intensity and size estimation network (TC-MTLNet) based on multitask learning. The overall network architecture is shown in Fig. 1. The IR channel and IR-WV-PMW three-channel combination are entered into the spatial and channel feature extraction module, respectively. The spatial feature extraction module uses four 2-D convolutions, while the channel feature extraction module uses four 3-D convolutions. The features obtained from the spatial feature extraction module are then fed into the PAM, and the features from the CAM are fed into the CAM. After the extracted spatial and channel features are fused and flattened, environmental factors are introduced into the fully connected layer. Subsequently, we employ two branches to study features related to TC intensity and size, and an auxiliary task branch to provide additional useful information for TC intensity and size estimation.
1) Spatial Feature Extraction: The spatial feature extraction module uses four 2-D convolutions. The size of the input IR channel imagery is 141 × 141, the convolution kernel of the first convolution block (2conv1) is 4 × 4, and the convolution kernels of 2conv2, 2conv3, and 2conv4 are 3 × 3. The step is 2 and the fill is 0. Rectified linear unit (ReLU) activation functions are introduced into the model after each convolution. The detailed network structure is shown in Fig. 1.
2) Channel Feature Extraction: Channel feature extraction: 2-D convolution slides only spatially, and its output is a cube made up of many faces, where the channel information completely overlaps. Unlike the 2-D convolution, 3-D convolution slides both in the spatial and channel dimensions. It outputs a large cube stacked with many cubes and retains channel information well. Therefore, for the channel feature extraction module, we employ four 3-D convolutions with the following convolution kernels sizes: 1 × 4 × 4 for 3conv1, 1 × 3 × 3 for 3conv2, 2 × 3 × 3 for 3conv3 and 3conv4. The number of convolution kernels are 16, 32, 64, and 128, respectively. Strides are set to 1 × 2 × 2, 1 × 3 × 3, 2 × 2 × 2 and 2 × 2 × 2, respectively, and the padding is 0.
3) Positional Attention Module: In the spatial feature extraction module, we adopt 3 × 3 and 4 × 4 convolution kernels, which only consider local receptive fields, while TC intensity and size estimation should take more global features into account. In order to capture the global dependence between different locations in the feature maps, we introduce the location attention module. Next, we elaborate the process of learning global features in the location attention module. The detailed structures are shown in Fig. 2. Extracted spatial features Next, F1 and F2 then are reshaped to f1 ∈ R C1 × N and f2 ∈ R C1 × N , where N is equal to H × W. Then, we feed the result of multiplying the transpose of f1 by f2 into the softmax function. The softmax function normalizes the result to obtain the location attention map A pam ∈ R (H × W )×(H × W ) . Therefore, the a ij on the location attention map A pam represents the relationship between the ith and jth location. F3 is reshaped to f3 and f3 ∈ R C × N is multiplied by the spatial attention map A pam to obtain F4 ∈ R C × N . The element f 11 on F4 represents the relationship between position 1 and all positions in channel 1. Finally, F4 is reshaped to f4 ∈ R C×H×W . The final output F pam ∈ R C × H × W is obtained by adding f4 and the original feature F p . . (1)

4) Channel Attention Module:
The IR channel provides information about water vapor content in the upper atmosphere, the WV channel provides information about water vapor composition in the middle atmosphere, and the PMW channel can penetrate clouds to detect rain. It is found that three channels are correlated. For example, previous studies have shown that features extracted from the IR and WV channels are similar, and features extracted from IR and PMW channels are complementary. In order to better extract the correlation between the channels more fully, we directly process the feature maps F c ∈ R C × H × W extracted by the channel feature extraction module to get the channel attention maps A cam ∈ R C × C . As shown in Fig. 3, we first multiply the reshape of F c with the transpose of the reshape of F c , and then normalize the results by the softmax function to obtain the channel attention map A cam . Therefore, the element a ij on the channel attention map A cam represents the relationship between the ith and jth channel. We then multiply A cam by the reshape of F c to get F4 ∈ R C × N . The element f 11 on F4 represents the relationship between position 1 on channel 1 and position 1 on the other channels. Finally, F4 is reshaped to f4 ∈ R C×H×W , and f4 is added to the original feature map F c to obtain the final output F cam ∈ R C × H × W .

5) Multitask Learning:
In order to simultaneously estimate TC intensity and size, we design a multitask learning model (see Fig. 1). The convolution layer parameters of the model are shared, and the parameters of the full connection layer are learned independently by three tasks. In the fully connected layers, we utilize two branches independently to learn features associated with TC intensity and size from satellite imagery. In addition, a branch of auxiliary tasks helps to learn features related to TC intensity and size by providing additional useful information. The underlying shared parameters of multitask learning are equivalent to data augmentation. For example, when a TC intensity estimation task encounters features that are difficult to learn, the TC size task or auxiliary task can easily learn these features. The TC intensity estimation task learns these hard-to-learn features because the underlying parameters of multiple tasks are shared.

IV. MODEL TRAINING
Our model is trained on GeForce RTX 2080ti 11 GB GPU based on the pytorch framework. The CPU environment is Inter Core i9-9900 K. During the training, we adopt the adam optimizer with a learning rate of 0.0005. We define the multitask loss function as L(t), which is achieved through a weighted sum of losses for different tasks. The detailed expression is as follows (2). L i (t) is the loss function of task i, and w i (t) is the weight of the loss function L i (t). w i (t) and L i (t) are dynamically updated with the number of iteration steps t (2) Since the gradient magnitude and convergence speed of different tasks are different, it is necessary to prevent multitask learning from being dominated by one task, which negatively affecting the effect of other tasks. L grad (t; w i (t)) is a function of the weight w i (t), which is shown in (3). We update the weight parameters w i (t) of multiple task loss functions The following formulas are the core of L grad (t; W i (t)). First, we define a formula to measure the loss magnitude of a task, which is expressed in detail as follows: The W is the multitask learning network parameter of the share section's last layer. G i w (t) is the L2 norm of the gradient for parameter w based on the result of multiplying the loss L i (t) of task i by the weight w i (t) of the loss L i (t). G i w (t) can measure the magnitude of the loss of a task. The larger the G i w (t), the larger the magnitude of the loss. G w (t) is the average of G i w (t) of all tasks. The formula is as follows: Then, we adopt two formulas to measure the learning speed for multiple tasks, which are expressed in detail as follows: L i (t) and L i (0) are the losses of steps t+1 and steps 1 of task i.L i (t) reflects the reverse training speed of task i. The larger theL i (t), the slower the loss decreases. E task [L i (t)] is the average of theL i (t) of all tasks. r i (t) represents the relative It can be seen from L grad (t; W i (t)) that when the loss of a task is too large or too small, |G i w (t) − G W (t)| becomes so large that L grad (t; W i (t)) increases. When the training speed of a task is too slow, r i (t) becomes larger, so L grad (t; W i (t)) also increases. The process of optimizing L grad (t; W i (t)) is to force the model to select an appropriate loss function weight w i (t) so that the gradient magnitude and convergence speed remain roughly the same for multiple tasks.

B. Design L i (t) Based on Label Distribution Smoothing
The TC observation data displays a long-tail distribution with significantly less low-intensity and high-intensity TC data. Most current methods are based on empirical label density to solve overestimation and underestimation (Density is the number of samples.). However, for continuous labels, the empirical label density fails to reflect the real imbalance seen by the network. For example, labels t1 and t2 both have a small amount of data [see Fig. 4(a)]. t1 is in the neighborhood of the high-density samples (i.e., there are many samples in the range [t1 − Δ, t1 + Δ], while t2 is in the neighborhood of the low-density samples (i.e., there are few samples in the range [t2 − Δ, t2 + Δ]). In this case, t1 does not have the same degree of unevenness as t2 for continuous labels because of the dependence between data samples on nearby labels. In order to alleviate the problem of overestimation and underestimation due to the unevenly distributed data, we adopt a label distribution smoothing approach. The idea of label distribution smoothing is to convolve the symmetric kernel with the empirical label density to obtain the effective label density distribution reflecting the real imbalance by using the similarity between nearby objects. Fig. 4(b) shows the effective label density distribution obtained by smoothing the label distribution, which effectively reflects the imbalance observed by the neural network. l k is the density of samples labeled k. E den is the average of the reciprocal of the effective label density of all labels. The detailed expression is shown as follows: We define the weight parameter of each sample loss as p j , which is achieved through dividing 1 l k by E den We obtain L i (t) by multiplying the loss (ŷ j − y j ) 2 for each sample by the weight parameter p j to address the problem caused by imbalanced TC samples. (see (9))

V. EXPERIMENT AND RESULTS
In this section, we first conduct module ablation experiments to verify the effectiveness of our model. Second, we verify the superiority of multitask learning compared to single-task learning, and the superiority of the automatic weighting method of loss function based on gradient normalization in multitask learning. Third, we verify the effect of label distribution smoothing method on the TC intensity overestimation and underestimation. We then apply visualization techniques to understand the features learning process and the important features for TC intensity and size estimation. Finally, we validate the effectiveness of our optimal model by scatter diagrams and lifecycle sequence diagrams, and compare the TC-MTLNet with traditional and deep learning methods.

A. Validation of Model Structure Design
In order to verify the validity of the model structure, we design 5 model experiments (M1, M2, M3, M4, M5) by gradually increasing the structure of the model. The evaluation and analysis of the experimental results are shown in Table II and Fig. 5. Considering that IR satellite imagery provides the most useful features for TC intensity estimation, we design a 2-D convolution model (M1) based on IR satellite imagery to extract spatial features. The details of the model parameters are shown in Table II. Then, we design a 3-D convolution model (M2) to extract the 3-D features of TC from the IR, WV, and PMW channels. M3 is a fusion of M1 and M2. It designs two branches to extract features from the IR channel and the multichannel satellite imagery using 2-D and 3-D convolutions, respectively. The root-mean-square error (RMSE) of M1, M2, and M3 are 9.33 kts, 8.88 kts, and 8.76 kts, respectively. The experimental results show that the combination of the two branches produces a lower error, which may be because the combination of spatial and channel features extracted by M3 obtains richer TC features. Next, M4 adds PAM and CAM to the spatial feature extraction module and the channel feature extraction module based on M3. The RMSE generated by this model is 8.69 kts. The experimental results show that the dual attention module pays better attention to the dependence of spatial features and the correlation of channel features. The last model M5 introduces location environment information (longitude, latitude) before the fully connected layer. The intensity estimation error is further reduced by model M5, which proves the validity of environmental information. Therefore, M5 is selected as the backbone of the TC-MTLNet.

B. Single-Task and Multitask Comparative Experiments
Multitask learning can learn useful information from related tasks, which can help improve individual performances. Fig. 6(a) shows that there is a significant positive correlation between TC intensity and size (r = 0.66). With the increase of TC intensity, the TC size increases. Fig. 6(b) shows a significant negative correlation between TC intensity and MSLP r = −0.95), which means that MSLP decreases with the increase in TC intensity. Fig. 6(c) shows a negative correlation between TC size and MSLP (r = −0.72). Therefore, we design a multitask learning model which shares parameters in the convolution layers. We utilize two branches to independently learn TC intensity and size features in the fully connected layers and adopt an auxiliary task (MSLP) branch to learn other useful information for TC intensity and size estimation.
In order to verify the effectiveness of the multitask learning model, we conduct two single-task experiments and four    Table III and Fig. 7. The TC intensity estimation RMSE, mean absolute error (MAE) and bias in the single-task experiment are 8.57 kts, 7.04 kts, and −1.01 kts, respectively. The single-task experiment for TC size estimation has an RMSE of 23.93 nmi, a MAE of 21.83 nmi, and a bias of −4.7 nmi. In multitask learning, the multitask1 directly adds three tasks and averages them. The experimental results are shown in Table III. The intensity estimation RMSE, MAE, and bias are 8.85 kts, 7.41 kts, and −1.3 kts, respectively. The size estimation RMSE, MAE, and bias of the size estimation are 23.94 nmi, 21.58 nmi and −4.53 nmi, respectively. It is obvious that the loss of TC intensity is higher than that of single-task learning. Because the magnitude of loss for different tasks may be different. The way in which the losses are directly added may lead to the learning of multiple tasks being dominated by one task. In other words, the multitask1 model tends to fit the TC size estimation task, which leads to a negative impact on the performance of the TC intensity estimation task. Therefore, the effect of the TC intensity estimation is worse. Compared with single-task learning, the accuracy of TC intensity and size estimation is improved in multitask2 and multitask3. However, fixed weights do not change during training. If the task of size estimation converges, but the task of intensity estimation does not converge. If the training continues until the intensity estimation task converges, the size estimation task will be overfitted. If the size estimation task is not trained at this time, the task of intensity estimation will not converge. Finally, we design a model multitask4 based on gradient normalization to dynamically update loss function weight. The experimental results show that multitask4 yields minimal TC intensity and size estimation errors. As shown in Fig. 7, the RMSE and MAE of the intensity estimation and size estimation in multitask4 are the smallest (RMSE: 8.40 kts and 23.01 nmi, MAE : 6.90 kts and 20.89 nmi, respectively).

C. Optimized Versions of Overestimated and Underestimated Model
In this study, TCs are classified into eight categories based on the Saffir-Simpson Hurricane Wind Scale (SSHWS) (T1-T5) [30], Tropical Storm (TS), Tropical Depression (TD), and No Category (NC).
TC observation data present a long-tailed distribution, in which TD and TS samples account for 29.3% and 37.7%, whereas NC, H1, H2, H3, H4, and H5 samples account for 8.8%, 9.9%, 5.1%, 3.9%, 4%, and 0.8%. We validate the model's performance using independent data in 2017 based on the optimal Mul_task4 configuration introduced in the previous section. As shown in Fig. 8, for NC (intensity < 20 kts) category, the bias of most samples is greater than 0, that is, these samples are significantly overestimated. For H1, H2, H3, H4, and H5 categories (intensity > 63 kts), approximately three-quarters of the samples are underestimated.
We utilize the label distribution smoothing method based on the similarity of nearby objects to deal with the overestimation and underestimation problems. The detailed experimental results are shown in Table IV. For the NC category, the RMSE of TC intensity estimation without data balancing is 7.12 kts and the bias is 7.11 kts. The RMSE from data balance experiment is 5.3 kts and the bias is 4.6 kts. Obviously, the label distribution smoothing method effectively alleviates the overestimation of TC intensity. For the H5 category, the RMSE of TC intensity estimation decreases from 10.34 kts to 8.0 kts after the balancing operation and tne bias changes from −10.22 kts to −6.8 kts. Obviously, the bias between the estimated and best-track intensity is reduced. For the TC size estimation, the RMSE and bias without data balancing are 7.21 nmi and 7.1 nmi for the NC category, and the RMSE from data balance experiment is 6.1 nmi and the bias is 4.4 nmi. For the H4 category, the RMSE of TC size changes from 34.6 nmi to 33.1 nmi after the balancing operation, and the bias change from −25.8 nmi to −24.0 nmi. The experimental results show that the label distribution smoothing method can effectively alleviate the problem of overestimation and underestimation for TC intensity and size estimation caused by unevenly distributed data.

D. Visualization
The TC-MTLNet model achieves excellent performance. However, we cannot accurately understand the CNN model's internal knowledge and the underlying reasons that drive it to make specific decisions. We visualize the output feature maps of our optimal model to understand the features learned by the network through different convolutional layers. What's more, we understand the role of the convolution kernel by the deconvolution visualization technique. Finally, we understand the contribution of each part of the imagery to TC intensity and size estimation through heatmaps.
1) The Visualization of Feature Maps: By visualizing the feature maps generated by each convolution layer, we can understand the learning process of the TC-MTLNet. Based on the optimal model, the fused output feature maps of the 2D conv1, 2D conv2, 2D conv3, 2D conv4, 3D conv1, 3D conv2, 3D conv3, and 3D con4 convolutional layers are visualized, which are shown in Fig. 9. Fig. 9(a) and (e) is the fused feature maps of the first 2D and 3D convolutional layer. We find that the network close to the bottom layer extracts easy-to-understand and general visual features such as TC contour edge features. Fig. 9(d) and (h) are fused feature maps of the fourth 2D and 3D convolutional layer, and the extracted features are mainly the TC structural features. These features synthesize the underlying visual features, which are more conducive to making TC intensity and size estimation decisions. Obviously, as the number of convolutions deepens, the learned features become more abstract and detailed. Abstract features represent high-level visual features, which are more helpful to accurate TC intensity and size estimation. Compared with the fused feature maps of 2D conv1, 2D conv2, 2D conv3, and 2D conv4, the fused feature maps of 3D conv1, 3D conv2, 3D conv3, and 3D conv4 extract more abundant features related to TC intensity and size. It is demonstrated that 3-D convolution extracts the 3-D features of TC by sliding in the channel dimension.
2) Understanding the Role of the Convolution Kernel by Deconvolution Visualization: Because different filters extract different features, we can understand the role of filters by visualizing the convolution kernels. However, directly visualized convolution kernels are abstract and which do not reflect much information. In order to understand the role of the filter, we use the deconvolution technique to visualize the feature maps generated by the filters. The specific approach is to multiply the feature maps learned by the our model by the transpose matrix of the convolution kernel corresponding to these feature maps to find the pixels activated by specific feature maps. The visualized imageries obtained by deconvolution technology are shown in Fig. 10. Fig. 10(a) shows simple spiral edge features learned by the 16 convolution kernels of 2D conv2. For example, it can be seen that the 11th convolution kernel of 2D conv2 mainly extracts edge features. Fig. 10(b) shows abstract detail features learned by the 64 convolution kernels of 2D conv4. For example, the 30th convolution kernel of 2D conv4 extracts the TC eye features.
3) Visualization of High Dimensional Features: There is a certain spatial correspondence between the output feature maps by the convolutional layer and the original imagery. We employ the Grad-Cam++ method to visualize the high-dimensional features. The method utilizes the feature maps of the last convolutional layer to generate a heatmap. We judge the contribution of each part of the satellite imagery to TC intensity and size estimation results by heatmaps, and reveal the critical factors that affect the intensity and size estimation results. The heatmaps and original imageries based on eight categories of satellite imageries are shown in Fig. 11. In the heatmap, the red areas represent the most important areas. Obviously, the higher the TC intensity, the higher the importance of the central region. It can be found that H1, H2, H3, H4, and H5 categories pay more attention to the inner core area. That is, high-intensity TCs pay more attention to the TC eye area. Because weak TCs do not focus on convective cloud centers, NC, TD, and TS categories pay more attention to the outer rainbands and cloud areas.

E. Performance Evaluation
In this section, we select independent data from 2017 to evaluate the performance of our best-optimized model.
To  Fig. 13. The red dotted line represents the best-track size and the blue dotted line represents the model-estimated TC size. We can observe that the model performance is lower when TC size increases, which may be because more complex meteorological factors are involved in the outer wind radius, which is a great challenge for our model and is worth studying in the future.  In order to verify the validity of our model, we first display a scatter plot to examine the model's ability to fit the best-track intensity and best-track size (see Fig. 14). In Fig. 14(a), the abscissa represents the best-track intensity, and the ordinate represents the estimated TC intensity. The scatter plot shows a significant positive correlation between the best-track intensity and the estimated intensity (r = 0.935), reflecting our model's excellent performance. In Fig. 14(b), the abscissa represents the best-track size, and the ordinate represents the estimated TC size. It is seen that although there are few outliers, the estimated size is significantly correlated with the best-track size. In general, the model fits the best-track size well.
As shown in Table V, the quantitative comparisons are impossible since previous methods employed satellite data from  different regions and channels. However, our study is based on satellite data from three channels around the globe, and our model has wider applicability than other methods. In Table V, we compare the results of our model with existing methods. The TC intensity RMSE of of our model is 8.40 kts, which is lower than that of previous study. Thus, the model we designed has a higher accuracy and wider applicability.
In Table VI, we compare the TC-MTLNet model with existing deep learning models and traditional methods. It can be found that the performance of the TC-MTLNet model exceeds that of MTCSWA and Meng et al.
The parameters, space consumption, and time consumption of model training are shown in Table VII, which clearly shows the number of parameters in each layer. To reduce the number of parameters, we mainly use 3 × 3 and 1 × 1 convolution. We can find that a total of 1 971 394 parameters are used in the process

VI. CONCLUSION
In this study, we proposed a parallel dual-attention TC intensity and size estimation model (TC-MTLNet) based on multitask learning. The model takes three measures to improve the ability of TC intensity and size estimation. First, the model employs two parallel branches to extract spatial and channel features. Before the fusion of the two branches, the spatial and the CAMs are applied to the spatial and channel dimensions to obtain the interdependence of features on the feature maps and the correlation between channels. This approach improves the capability of feature representation, thereby improving the performance of our model. Compared with single-task learning, multitask learning can learn additional useful information from related tasks, positively affecting TC intensity and size estimation. However, the multiple tasks learning speed and gradient magnitude are different. Therefore, we adopt the gradient regularization method to prevent the learning of multiple tasks from being dominated by one task. In addition, we propose a label distribution smoothing based on kernel distribution by taking advantage of the similarity between nearby targets in order to solve the overestimation and underestimation caused by the uneven distribution in TC observation data. This method reduces the bias in TC intensity and size estimation, and further improves the accuracy of model estimation. Finally, we visualize feature maps, convolution kernels, and high-dimensional features to understand how the TC-MTLNet model learns intensity and size features and identify the essential factors for intensity and size estimation from TC satellite imagery. We evaluated the performance of the best-optimized model using independent data from 2017. The experiments show that our model generates an intensity estimation RMSE of 8.40 kts, which is lower by 33.5% compared to ADT and 11.4% compared to 3DAttentionTCNet. Furthermore, a size estimation MAE of 23.76 nmi shows that the performance of our model surpasses MTCSWA by 10%. In this article, we design a model that alleviates the overestimation and underestimation problems caused by unevenly distributed TC data. In future work, this is a research direction. In addition, due to the lack of deep learning research on TC size, a future work will focus on systematically studying TC size estimation to provide more convenience for TC disaster prevention.