MorseNet: A Unified Neural Network for Morse Detection and Recognition in Spectrogram

Short-wave radio is an indispensable long-distance means of communication, among which Morse signals, which rely on simplicity and efficiency, plays an import role in military and civilian applications. Automatic Morse detection and recognition have been researched for several years, but some thorny problems in actual communication always restrict the performance of methods. In this article, by introducing deep learning technology, we propose a network named MorseNet that can simultaneously locate and decode Morse signals in the spectrogram. MorseNet uses shared convolutions to extract shared features for both the detection and recognition branches. The detection branch regresses bounding boxes based on signal centerlines, and the recognition branch decodes Morse fragments cropped from feature maps by a convolutional recurrent neural network (CRNN). The losses of two branches are combined to implement the end-to-end training. Experimental results on four “simulated Morse + real background” datasets demonstrate that the proposed method achieves state-of-the-art performance in both detection and recognition, and it effectively improves four problems that have long been troublesome in accomplishing the tasks. Furthermore, the joint training strategy and architecture give MorseNet advantages over its two-stage deployment in terms of accuracy, speed, and model size.


I. INTRODUCTION
A Morse signal is a type of continuous wave (CW) with a steady frequency and intermittent time. It consists of 5 types of codes: dot, dash, intra-code interval, inter-code interval and code group interval, the permutation order of which can represent different characters. Due to the simple coding scheme, narrow frequency band, and strong anti-jamming capability, Morse signals are widely applied in aviation, maritime and military communications [1]. At present, the copying of Morse signals, especially those sent manually, are mainly implemented by humans, which imposes pressure on the operators and has an unstable accuracy. Therefore, automatic Morse detection and recognition have been researched for many years, but some tricky problems in actual communications make it quite difficult. In recent years, in view of the excellent performance of deep learning (DL) technology on images, speech, and natural language processing, it has also The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano . been introduced to different aspects of communications and networks [2], [3] and has shown great potential.
Due to the specialty of having steady frequency and intermittent time in Morse, time-frequency analysis methods led by short-time Fourier transform (STFT) [4] dominate the preprocessing. The spectrogram obtained by STFT can clearly visualize the time and frequency information of the signals. Morse detection is the premise of recognition, and the aim is to detect the presence and time-frequency location of Morse in received wireless data. In the spectrogram, traditional methods usually first extract the fragments that contain signals by energy detection, and then, they design classifiers, including machine learning or deep learning models, to classify signal type [5]- [7]. Energy detection plays well in a spectrogram with scattered signals, but it could make mistakes when signals are densely distributed. Recently, some researchers have exploited the single shot multibox detector (SSD) network to detect multi-type signals in spectrograms [8], [9]. SSD is a common DL-based object detection method that is capable of locating signals by a bounding VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ box (BBox) and identifying the type. However, it uses the center point of the object to predict the BBox size, the receptive field of which is limited, especially for horizontally long signals. Thus, it usually fails to predict the complete BBox for the signals, which is inacceptable for the subsequent recognition task. In addition, it raises too many candidate anchors to regress, costing much time, and the anchor size is difficult to determine because of the dramatic change in the length of signals. Inheriting from the SSD, we targeted the characteristics of the signals and proposed an improved detector in our earlier work [10]. The detector first finds the centerline of the signal in a heat map, whose receptive field could cover the whole signal, and then, it predicts BBox size directly at the centerline points, thus abandoning the anchors, which proposes a more intact BBox and simplifies the model to greatly speed up. In view of the excellent performance of DL in computer vision, it has the potential to be introduced to spectrogram-based signal detection. For the recognition of detected Morse, a common method is to identify the code types (dot, dash, and three intervals) in time sequence, and then, to look up the code-tocharacter table to obtain the final text. The code types are classified by the code lengths. In the spectrogram, traditional methods [5], [6], [11] implement image processing, including contrast enhancement, binarization and morphological denoising, to highlight the Morse regions in a spectrogram, where the lengths of the bright strips and their intervals are recorded as a feature set. Then, a clustering method, such as k-means or c-means, is introduced to classify the code types. Those methods divide the recognition task into multiple stages, which increases the complexity, and they depend heavily on the image processing effect. In [12], we utilized a convolutional recurrent neural network (CRNN) to accomplish end-to-end image-to-character level recognition. The CRNN makes use of a convolutional neural network (CNN) to extract an image's deep features and a recurrent neural network (RNN) to capture the context information, which has greatly improved accuracy, simplified the processing, and has no table look-up.
It can be seen that DL-based methods have achieved stateof-the-art performance in both detection and recognition of Morse [10], [12]. However, the two neural networks are trained separately, which means that the character level information in the recognition that could help improve the detection effect is ignored by the detection model. In addition, recognition is conducted on Morse regions cropped from the original image, one by one, which costs a substantial amount of time, especially for spectrograms that contain many Morse signals. In addition, duplicate CNN-based feature extractors in the detection and recognition network introduce operational redundancy. Inspired by the FOTS method [13], which is a typical text spotting network that combines two text detection and recognition networks to obtain better and faster performance, we propose to combine the Morse detection and recognition networks into a unified network named MorseNet and implement end-to-end training. Based on the detection model in [10], we add the CRNN model [12] after the feature extraction CNNs as a recognition branch. Thus, the feature extraction CNNs become a shared convolution, which supplies shared features to both the detection and recognition branches. The detection branch is a multi-channel convolutional network that locates signals at its centerline and regresses to BBox. The recognition branch consists of CNNs, a bidirectional long short-term memory (BLSTM) encoder, and a connectionist temporal classification (CTC) decoder. Through joint supervision, the visual and context information can be shared between two tasks, which are thus expected to improve the performances of each other. In addition, shared CNNs could save the duplicated time cost. Experimental results show that our MorseNet outperforms traditional methods and its two-stage version method in both accuracy and speed.
To summarize, the contributions of this article are as follows: • We propose a unified neural network named MorseNet for the detection and recognition of Morse signals in spectrograms. To the best of our knowledge, this study proposes the first DL-based architecture for simultaneously detecting Morse signals and recognizing Morse codes.
• We introduce a shared convolution, to extract shared features for the detection and recognition branches, and combine two branch losses to implement end-to-end training, which improves the accuracy and saves time.
• To make experiments persuasive, we simulate Morse signals and add them into real-world background in the time domain. Experimental results show that MorseNet obtains state-of-the-art performance in both detection and recognition on four datasets.
For the remainder of this article, Section II reviews the related work on Morse detection and recognition, while Section III introduces data collection and some common problems in the task. Section IV describes the details of our methodology, and Section V evaluates the performance of MorseNet in comparison with baselines. The conclusions are drawn in Section VI.

II. RELATED WORK
Automatic Morse detection and recognition are two problems that have a long history. After the creation of Morse in 1837, many researchers have studied it. In this section, we give a brief introduction to related work on those two tasks, which are summarized in Table 1.

A. MORSE DETECTION
Existing Morse detection methods can be categorized into traditional methods and DL-based methods.
Traditional methods focus mainly on the time domain, the frequency domain or both. Envelope detection is the earliest method [14], which has a fast speed but weak noise resistance and has poor practicability in the currently complex electromagnetic environment. A phase-locked loop [15] can track the signal frequency, under the premise of accurate signal frequency estimation, and it is sensitive to interference. Filtering methods, including Kalman filtering [16] and adaptive filtering [17], are also introduced. By elaborately designing a filter, the signal can be effectively denoised, but it also requires a frequency estimate in advance and is powerless to address an unstable frequency. Some signal transformations, such as Fourier transform [18], complex variance spectrum [19] and wavelet transform [20], can find the Morse frequency in the spectrum, but they only obtain the frequency distribution of the signals, without the time information, and they cannot effectively distinguish Morse from other signals. The methods above mainly process in the time domain, under the assumption that processed data contains Morse of only one channel. Time-frequency analysis methods, which take advantage of the typical characteristics of Morse in both the time domain and the frequency domain, have become the mainstream. Yue et al. [21] utilized a discrete Gabor transform to obtain time-frequency information of Morse, but they lacked a related algorithm to distinguish Morse from interference. As the main idea in most of the literature, Wei et al. [5], Sun et al. [6], and Yuan et al. [7] employed energy detection on the spectrogram and introduced a classifier such as a machine learning or DL model to classify the signal type. Among them, the CNN-based model in [7] obtained the best classification result. Nevertheless, since energy detection is quite sensitive to the noise, especially in short-wave communication, those methods suffer from a low detection accuracy for the signal types of interest.
For the DL-based methods, in recent years, Zha et al. [8] and Singh. A [9] utilized the DL-based object detector SSD, and they converted the task of multi-type signal detection in a spectrogram to object detection in an image, which is a sparkly idea capable of locating signals of different types. However, SSD and other object detectors usually raise many candidate anchors in advance, the size of which is difficult to determine because of the dramatic change in the length of signals, and their regression is time-consuming. Moreover, their center point-based detection is not suitable for horizontally long signals, which leads to incomplete BBox proposals. To make up for the above shortcomings, in [10], we proposed a centerline-based neural network that models the signal based on its centerline and corresponding properties, other than the candidate anchors, and we achieved state-of-the-art performance for multi-type signal detection in spectrograms.

B. MORSE RECOGNITION
Traditional recognition steps are to first obtain the code lengths, then classify the code types, and finally, look up the code-to-character table.
To obtain the code lengths, researchers in [23]- [25] directly tracked signal waveforms in the time domain. To get rid of interference and highlight the electric levels, they usually conducted filtering and binarization on the original data as preprocessing. Those methods worked under the condition that there is only one channel of Morse in data, and they depended greatly on the preprocessing effect. Wei et al. [5], Sun et al. [6], and Wang et al. [11] adopted the spectrograms by STFT and combined them with image processing tools to highlight the Morse regions, where the lengths of the bright strips and their intervals were counted. However, their performance was also limited by the image processing effect.
In classifying the code types, it depends on the time lengths of the codes. Theoretically, the length ratio of dot, dash, intracode interval, inter-code interval, and code group interval is 1:3:1:3:5. Xiao and Gao [20] modified the Gunther algorithm, which was an earlier but relatively powerless decoding algorithm. Some researchers constructed traditional machine learning models such as support vector machine (SVM) [16], [25], k-means cluster [6], [11], [22], [24], c-means cluster [6], and so on, to classify code. Traditional machine learning models use only code lengths as features, without context information, which is not robust to sharp code length deviations. The above methods accomplish code level recognition, and additional post-processing is inevitable, including codeto-character table look-up and error correction. Researchers in [6], [26] designed algorithms to speed up the table lookup, and those in [11], [24] made error correction rules to further improve the recognition results. To summarize, the bottleneck of traditional methods lies mainly in obtaining the code lengths, which places a large amount of pressure on preprocessing in the recognition task. In addition, the above methods are all multi-stage, which could cause error accumulation, and the table look-up and error correction are time-consuming.
Recently, Wang et al. [27] utilized the hidden Markov model (HMM) + deep neural networks (DNN) that was a classical speech recognition algorithm to accomplish character level recognition, but its performance was not sufficiently high. Inheriting from it, in [12], we used a deep neural network CRNN to recognize Morse in a spectrogram at the character level and obtained state-of-the-art performance, which was also the first DL-based attempt on this task. In recent years, DL technology shows a strong ability in image perception and sequence modeling, and thus, it is very suitable for spectrogram-based Morse recognition.
As can be seen, the applications of DL in Morse detection and recognition are relatively rare, let alone a unified network that implements the end-to-end task. Compared to two-stage processing of detection + recognition, end-to-end processing could let two tasks share the learned features from each other, and save time by merging redundant structures. Thus, in this article, we decided to construct a neural network with elegant and complementary architecture to accomplish this task.

III. PRELIMINARY
Considering the real-time ability of our system, the input is a narrowband spectrogram that contains multi-channel Morse signals. In this section, we introduce our data collection method and some long-standing problems faced by Morse detection and recognition tasks.

A. DATA COLLECTION
Our dataset mainly consists of synthetic Morse signals and real-world wireless signals that are mainly taken as background noise. To simulate various degrees of skill, we designed formulations for tuning the code speed deviation, frequency drift, frequency jitter and other common distortions for generating Morse signals, the same as in [12]. Then, real-world wireless signals are added with various signal-to-noise ratios (SNR), to constitute the final synthetic datasets. The background signals are collected by a short-wave radio station WiNRADiO G39DDC [28], which received wideband data that contains various types of signals. We implement digital down-conversion (DDC) to obtain narrowband backgrounds and add multi-channel Morse signals in the time domain. Background data are collected at different times of the year, and conversed from different frequency bands.
After the combination of simulated Morse and real background, we transform data to a narrowband spectrogram by STFT, the calculation of which is as follows: where s(m) denotes the sampled signal, w(m) denotes the Hanning window function, and P n (ω) is the time-frequency energy matrix. The resolution of the spectrogram is determined by the step time l of the Hanning window and the FFT point n fft . Although decreasing l or increasing n fft could make the spectrogram display more detailed information in the time or frequency domain, it enlarges the image size, which reduces the real-time performance. Based on our engineering experience, we set l = 0.02 s and n fft = 1024 for the data with a 15 s duration and a 9000 Hz sampling frequency. Fig. 1 is an instance of our input spectrogram.

B. MAIN PROBLEMS IN DETECTION AND RECOGNITION
Automatic detection and recognition of Morse have been researched for many years, but with few large breakthroughs. This circumstance is mainly blamed on the fact that several thorny problems have not been adequately addressed.

1) FADING AND FREQUENCY DRIFT IN THE SHORTWAVE CHANNEL
A shortwave channel is a typical random-parametric model, transmitting signals by ionospheric reflection, which has a multipath effect. A change in the ionosphere or weather destabilizes the channel, which is accompanied by energy fluctuations and frequency drift of the received signal. When burst interference or fast fading occurs, SNR declines sharply, which requires strong robustness of the algorithm at a low SNR.

2) ADJACENT CHANNEL INTERFERENCE
Adjacent channel interference refers to when a radio station receives more than one channel of Morse or other signals at its working frequency. In this case, a detection algorithm must distinguish the signals of not only different channels but also different types, which demands high frequency resolution and strong classification ability.

3) CODE SPEED DEVIATION AND FREQUENCY JITTER
A mechanical transmitter sends Morse code with a standard time ratio of dot, dash, and intervals. However, Morse code sent manually usually has a code speed deviation. In addition, many telegraph operators use the telegraph key to send Morse, which could cause frequency jitter at the code start or end. The code speed deviation and frequency jitter place a large amount of pressure on the recognition algorithm.

IV. METHODOLOGY
MorseNet is an end-to-end trainable neural network that detects and recognizes all Morse signals in a spectrogram. It consists of four main modules: shared convolution, detection branch, region extraction, and recognition branch.
A. OVERALL ARCHITECTURE Fig. 2 illustrates the overall architecture of MorseNet. Shared convolution is used to extract shared features for subsequent detection and recognition branches. The backbone of the shared convolution is the same as in [10], which is a ResNet18 network [29] combined with three up-convolutions. Fig. 3 shows the general structure of the shared convolution. The input first pass through a series forward convolutions with a size decrease and a channel increase, and then, three up-convolutions are implemented to enlarge the feature map. The level of the extracted features increases with the number of convolutions, and we connect low-level and high-level feature maps of the same size. From this, the features of different levels can be effectively combined to take account of both the detailed and overall information. The resolution of the final feature map is 1/4 of the original spectrogram. The detection branch is a multi-channel convolutional network that utilizes shared features to locate the centerlines of the Morse and regress to the BBoxes. Then, the region extraction module crops the Morse regions from the feature map and converts them to a fixed height. Finally, the text recognition   branch translates the codes to text with CNNs, a BLSTM encoder, and a CTC decoder.

B. DETECTION BRANCH
Since Morse signal has a fixed frequency and very narrow bandwidth, the centerline-based method in [10] is very suitable for its detection. Inspired by this, we construct a fully convolution network as the detection branch whose schematic diagram is plotted in Fig. 4. Using the shared features, it predicts three attributes of the Morse region: centerline, local offset, and border offsets. The centerline refers to the horizontal centerline of the Morse region, whose heat map has one channel and represents the pixel-wise probability of belonging to the centerline. Local offset is predicted to offset the positional deviation of the centerline during down-sampling and up-sampling of the shared convolutions, whose map has one channel and valid values within the centerline. Border offsets represent offsets between the centerline and up/down border lines of the Morse region, whose map has two channels and valid values within the centerline.
The loss of the detection branch is composed of three parts, which correspond to the above three attributes. For the centerline loss, during the ground truth map production, we apply smooth probability to the adjacent points of centerlines by a Gaussian kernel. The training objective is a pixel-wise focal loss [30]: where p is a point in the map,P p is the ground truth label at p, P p is the corresponding prediction, N is all points set in the heat map, α and β are the hyper-parameters of the focal loss.
In (3), (1 − P p ) α and (P p ) α reduce the weights of the easy-toclassify samples and increase those of the difficult-to-classify samples.
(1 − Pp) β reduces the weights of 0 <P p < 1 (here referring to the adjacent points of centerlines), especially that close to 1, to reduce their impact on training. We empirically set α = 2 and β = 4 in our experiments. For the local offset and border offsets losses, we directly calculate the average difference between the ground truth and the predicted value at the centerline points: wherep is the mapping point of p in the original spectrogram, is centerline point set in the heat map, R is the shrunken scale of the feature map (here 4), and | · | denotes the number of elements. Y is the vertical coordinate of point, and O p is the predicted local offset. U p ,Û p are the predicted/ground truth up border offset, and D p ,D p are the predicted/ground truth down border offset.
The detection loss is the weighted sum of three attribute losses: where λ 1 , λ 2 , λ 3 are empirically set to 1.0, 0.5, 0.5 in our experiments. We assume that x min , x max are the starting and ending abscissa of a centerline, whose ordinate is y. Thus, the lower left and upper right coordinates of a predicted BBox can be calculated as:

C. REGION EXTRACTION
Region extraction is aimed at extracting Morse regions in a shared feature map, by using the output BBox of the detection branch. To adapt the convolution processing of the recognition branch, we shrink the region fragments to a fixed height 8 with an unchanged aspect ratio. Thus, the length of the regions is kept variable, which avoids the misalignment between the features and the original image and preserves the semantic information as much as possible. In practice, we pad each of the region fragments to the longest length of a batch and ignore the padding parts during the recognition. When training MorseNet, the detection branch may provide nonstandard region proposals, especially at the beginning of the training, which could cause wrong learning of the recognition branch. Thus, we feed the ground truth Morse regions to the recognition branch during training. When testing, the confidence threshold and non-maximum suppression (NMS) are introduced to filter the region proposals. Selected Morse regions are then fed into the recognition branch for character level translation.

D. RECOGNITION BRANCH
The recognition branch utilizes the shared features in each Morse region to predict the text labels. It consists of sequential CNNs, a BLSTM encoder [31], and a CTC decoder [32]. The specific structure of the recognition branch is shown in Table 2.
Sequential CNNs are first built to further extract the image semantic information. Since the size of the feature map has shrunk twice in shared convolution and region extraction, we pool it only along its height axis to avoid missing text content, especially those characters that have a short code length. Through CNNs, the heights of the input feature fragments are compressed to 1 while the channels are increased to 256. In [12], we have performed related experiments to confirm that the convolution processing before the RNN encoder could effectively improve the model's performance.
The CNNs' outputs are permuted to feature sequences in the time axis and are fed into the RNN layer for encoding.
Here, we chose LSTM with 256 units to effectively capture the contextual information. Since the front and back frames in the feature sequence both help in the modeling of the current frame, we use BLSTM, which consists of a forward LSTM and a backward LSTM. Hidden states computed in two directions are summed up and fed into a fully-connected (FC) network. The FC transforms hidden states to a frame-tocharacter probability matrix. To avoid overfitting, a dropout operation is added before the FC.
The CTC layer is used to transcribe the probability matrix to the final text. The length of text is usually much shorter than that of feature sequence, since one character is usually mapped by multiple frames; as a result, what CTC does is to flexibly merge repetitive predictions of frames. CTC introduces a prediction path π=(π 1 , π 2 ,. . . ,π L ) as framewise predictions for the feature sequence x, and a 'blank' character to separate adjacent same labels. By merging the same characters between two 'blank' and deleting 'blank', the prediction path is transcribed to the final text. For example, ''-aa-p-p-ll-e-'' to ''apple'' (''-'' refers to ''blank''). The text probability is the sum of the prediction path probabilities that can be transcribed to the text: where p(π|x) is the probability of a prediction path, q π l l is the softmax probability of label π l at frame l, (y) refers to all of the CTC prediction paths that can be transcribed to text y, and p(y|x) is the final text probability. The recognition loss can be calculated as follows: where N is the number of Morse regions in a spectrogram. The end-to-end loss function of MorseNet is a combination of detection and recognition losses: where λ recog is a hyper-parameter that trades off the detection branch and recognition branch, which is set to 1 in our experiments.

V. EXPERIMENTS AND DISCUSSION
In this section, we conduct experiments on four datasets. We first give introductions to the baseline methods and implementation details, and then, we show the detection and recognition performances of methods. In particular, the results of MorseNet in the harsh situations mentioned in III-B are visualized, and several sensitivity tests on the SNR, code speed, and hyper-parameters are conducted. Finally, we illustrate our advantages over the two-stage version method in terms of the speed and model size. VOLUME 8, 2020 A. BASELINES MorseNet is a fully DL-based neural network, and its detection branch and recognition branch have each obtained state-of-the-art performance [10], [12]. Before our method, Sun et al. [6] implemented multi-channel Morse detection and recognition in a spectrogram by energy detection + decision tree for detection and k-means clustering for recognition, the ideas of which were then extended further, in which CNNs were used to replace decision tree, and they obtained the best classification effect [7]; image processing was also introduced before k-means to highlight the codes [5], [11]. We combine the above methods to a relatively advanced traditional method to compare with MorseNet. The DL-based method SSD is also compared in detection performance.
In addition, a two-stage system based on MorseNet is built to demonstrate the accuracy and speed advantages of our endto-end system. Energy detection + CNNs + image processing + k-means (ECIK) [5]- [7], [11]: ECIK is actually a four-stage system that combines various traditional methods. Energy detection extracts fragments from a spectrogram with strong energy, and then, CNNs are built to classify them to select those that contain Morse. Image processing, including contrast enhancement, binarization, and morphological denoising, is implemented to highlight codes and to denoise. Finally, the code lengths are counted and fed to a k-means model to classify the code types, and the text is translated by the codeto-character table look-up.
SSD [8]: SSD is a representative of DL-based object detectors. The rough idea of SSD is to first raise candidate anchors at each pixel of the extracted feature map and, then, predict the positive probability and size regression for each anchor by several CNNs. The architecture of the SSD used in our experiments is the same as that in [8], where it is exploited to detect multi-type signals in wideband spectrograms. Since SSD targets the detection task, we only compare it in detection performance.
Our Two-Stage: We propose a joint training strategy and architecture to let the network be supervised by both the detection and recognition tasks, with the expectation of improving the accuracy and speed. To verify this approach, a two-stage system is built in which the detection model and recognition model are divided from MorseNet. Two models are trained separately. The Morse fragments cropped from the original spectrogram by the detection model are input to the recognition model.

B. IMPLEMENTATION DETAILS 1) EXPERIMENTAL DATASET
Simulated signals combined with real-world backgrounds are used to evaluate the performance. The backgrounds are narrowband data down-converted from wideband. The same as in [7], we divide the experimental data into four datasets based on the frequency bands that the backgrounds are converted from. The dataset and spectrogram information are described in Table 3.

2) TRAINING SETTING
We implement the proposed MorseNet model using Tensorflow [33]. An Adam optimizer [34] with a learning rate of 2 × 10 −4 is used to optimize the network. We set 0.3 dropout, 0.95 momentum and 1×10 −5 weight decay to inhibit overfitting, and we exploit data augmentation, including randomly cropping, scaling, and Gaussian noise, to improve the learning effect. All of the models are trained to converge with a batch size of 50, and the experiments are performed on a Tesla P40 GPU.

3) METRICS
We evaluate the detection and recognition performance during the end-to-end task, where the input of the recognition branch is the Morse regions proposed by the detection branch. The detection metrics are the precision, recall, and F1-score, where the intersect-over-union (IoU) threshold is set to 0.5. What must be emphasized is that for the ECIK model, its detection module can only propose the frequency location of Morse, without the start/end time, which extracts the fragments with the whole time duration of the spectrogram as Morse regions. To fairly evaluate the detection performance of MorseNet and ECIK, we adjust the denominator of the IoU function to the minimum size of the predicted (P) and ground truth (G) BBoxes by (12). Since the annotated boxes in MorseNet and the extracted fragments in ECIK both have a fixed frequency band of 400 Hz, it will not be possible to propose a very large region to obtain a high IoU score. Recognition metrics are the character error rate (CER) and the word error rate (WER). CERs are calculated from the edit distance between the predicted and ground truth text, as in (13). WER refers to the proportion of mistranslated text in all text, as in (14).

C. DETECTION AND RECOGNITION PERFORMANCE
We compare MorseNet with baseline methods in detection and recognition, and we give the quantitative results in Table 4. As can be seen, MorseNet significantly outperforms the traditional ECIK method, and it also surpasses SSD and Our Two-Stage methods, both in detection and recognition.
For the detection part of ECIK, the classification ability of the CNNs is strong enough, and thus, the performance depends greatly on the energy detection. However, the energy threshold is difficult to determine, since the energy distribution fluctuates rapidly, and it is sensitive to interference. When the energy threshold is too high, some Morse signals could be omitted, or only part of the signal is detected, which results in the relatively low detection scores of ECIK. The MorseNet detection branch utilizes CNNs to classify the objects in the whole spectrogram, other than the selected fragments of the energy detection. CNNs can learn multidimensional features, unlike energy detection, which exploits only energy amplitudes, thus effectively distinguishing objects and improving the detection performance. At the same time, MorseNet regresses a BBox that tightly surrounds the signal, removing the needless background for a better subsequent recognition effect. Compared to SSD, MorseNet is more suitable for the signal characteristics-utilizing the points in the centerline to make predictions, instead of only the center point, which ensures that the receptive fields of the CNNs can cover the whole signal, hence leading to better performance.
For the ECIK recognition part, similar to the detection part, its performance is heavily determined by image processing.
Image processing tools could get rid of only interference with weak energy or a scattered distribution, the effect of which is limited in real-world communications. Moreover, the k-means algorithm uses only the code lengths in the entire Morse region to cluster, which could easily produce errors when there are codes with a large length deviation. For the MorseNet recognition branch, the feature sequence extracted by the CNNs could clearly reflect the Morse and interference signals' distribution. Additionally, BLSTM possesses excellent sequence modeling ability, which learns various code length deviation cases during training, hence showing better recognition performance. In addition, the outstanding detection performance of MorseNet lays a good foundation for recognition.
The comparative results of MorseNet and Our Two-Stage show that our unified architecture contributes to better convergence compared with the separate models. The joint training strategy lets the feature extraction module simultaneously be optimized by two tasks, where the character level features learned from the recognition branch help detection branch to distinguish the Morse signal from background, and an enhanced detection module in turn improves the recognition effect. To reflect the universality of the methods, we test them in cross-data mode, which refers to training on the 5 M dataset and testing on the 11 M dataset. The ''Cross-data'' results in Table 4 show the performance of the method has a slight drop, most likely because some background noise in the test dataset has not been learned during training. However, MorseNet can still perform at a high level. VOLUME 8, 2020 To further visualize the MorseNet performance in an actual environment, we plot some results in the spectrograms. As shown in Fig. 5, MorseNet greatly improves the detection and recognition effect under four commonly harsh circumstances mentioned in III-B. For thelow SNR case, Fig. 5(a) shows two spectrograms at SNR −10 dB, where the Morse signals are covered by strong noise, and it is even hard for the human eye to recognize the codes, while MorseNet completely locates the signals and correctly decodes them. For the adjacent channel interferencecase, benefitting from the strong image recognition ability of the CNN, MorseNet is not affected by the single frequency noise in the top spectrogram and the speech signal in the bottom spectrogram. For the code speed deviation case, in the top spectrogram, the interval lengths vary substantially between the codes, and in the bottom spectrogram, the dot lengths of ''d'', ''v'', and ''w'' are close to the dash lengths, which could cause errors for the clustering algorithms. MorseNet achieves accurate recognition, which is mainly due to the context-based modeling of BLSTM. For the frequency drift and jitter case, the top spectrogram has a frequency drift instance, and the bottom spectrogram has frequency jitter instances. MorseNet still detects the complete signals, but a bad code length deviation in the top picture causes wrong recognition. Although the frequency is unstable, during shared convolution, the original image is shrunk to a smaller feature map, which means that signals with limited frequency drift or jitter still roughly appear as a horizontal line in the feature map.

D. SENSITIVITY ANALYSIS
In this subsection, we evaluate the influence of the Morse signal properties and the model hyper-parameters. The signal properties include the SNR and the code speed. Specifically, we plot the F1-score curve for detection evaluation and the CER curve for recognition evaluation versus different parameters on the 5M dataset. We train models at basic configurations in V-B and test them with varied object parameters while keeping the others fixed.

1) SNR
The SNR in Fig. 6 specifically refers to the power ratio between the simulated Morse and the real background. For the detection performance, some extent of decrease in the SNR has few impacts on that of MorseNet, Our Two-Stage, and SSD, while ECIK has obvious frustration at low SNR. Through our inspection, the general outlines of the Morse signals can still be observed at a relatively low SNR, and due to the strong image recognition ability of the CNN, Mors-eNet, Our Two-Stage and SSD find the location of the Morse. Although the ECIK detection part also has a CNN model, the effect of the preceding energy detection has been greatly weakened under low SNR, which leads to signal missing or distorted proposals. For recognition, all of the three methods' performances tend to decrease as the SNR goes down. The reason can easily be considered to be that Morse codes are drowned out by strong noise and are too vague to recognize.

2) CODE SPEED
The performance tendency in Fig. 7 is similar to that in Fig. 6. For the detection performance, all of the four methods are not influenced much by the code speed. The reason could be that although the speed changes, the Morse signal can still be relatively easy to distinguish in the spectrogram. For recognition, a code speed rise leads to a performance decrease for the three methods, and we think that time resolution of the spectrogram may do matter to it. Especially at 40 words per minute (wpm), the time length of a dot is approximately 0.03 s, while the time resolution of the experimental spectrogram is 0.02 s, which means that a dot takes up only one to two pixels, or vanishes, and thus, the error-presented signals are naturally misrecognized.

3) HYPER-PARAMETERS
We implement parameter tunings on several model hyper-parameters to determine the specific configuration of MorseNet. The results are plotted in Fig. 8-10: (1) Fig. 8 shows the F1-score under different channels of the first CNN layer in the detection branch (the first CNN layer in Fig. 4); (2) Fig. 9 shows the CER under different channels of the three CNN layers in the recognition branch; (3) Fig. 10 shows the CER under different layers and cell numbers (Ncell) of BLSTM in the recognition branch. Following the principle of ensuring the accuracy and keeping the model as small as possible, we finally chose ''Channel: 32'' in (1), ''Channel: [64,128,256]'' in (2), and ''Layer: 1, Ncell: 256'' in (3).
Although the method performances fluctuate with the SNR or the code speed variation, the DL-based methods MorseNet,  Our Two-Stage, and SSD always surpass the ECIK method, thus showing a stronger robustness. In addition, the better performance of MorseNet compared with Our Two-Stage also demonstrates the improvements obtained from our joint training strategy.

E. SPEED AND MODEL SIZE
In Table 5, we evaluate the speed of four methods with and without GPU, and the model size of MorseNet and Our Two-Stage. The speed metric is FPS, which refers to the number of processed images per second. The model size is measured by the sizes of the model parameters. The results illustrate that the non-DL method ECIK (the neural network is only a small part of ECIK) has an advantage in speed, especially in recognition. For MorseNet and Our Two-Stage methods, benefiting from the concise centerline-based detection structure, they detect Morse at a fast speed, compared to SSD.  refers to the detection + recognition task. The model size of ECIK is not provided since the neural network is a small part of its overall architecture. D + R speed and model size of SSD are not provided since it is used for only the detection task.
However, their recognition part consumes most of the time, because the BLSTM is a sequential processing model that cannot take advantage of the GPU parallel computing capability. Without the GPU, the detection speeds of all of the methods decrease obviously, since the CNNs in the models lose parallel computing. However, the processing speed of MorseNet can be acceptable, and it still has obvious advantages in terms of speed and model size compared to Our Two-Stage. Since MorseNet uses a shared convolution to extract shared features, and it inputs the fragments cropped from shrunken feature maps instead of original image to the recognition branch, it effectively saves on computation and storage. As a consequence, MorseNet achieves state-of-theart performance while keeping a real-time capability.
We calculated the average processing speed on all of the testing dataset described in Table 3. The input spectrogram is 749 × 512 images spanning 15 s in time length and 4.5 kHz in frequency width, where the time resolution (15/749 = 0.02s) and frequency resolution (4500/512 = 8.79Hz) are sufficient to present the signal while not making the image too large. The speed results of MorseNet in Table 5 show that it can process 109.5 s (7.3 × 15 = 109s) signals per second with a GPU, and 83.55 s (5.57 × 15 = 83.55s) signals per second without a GPU. The experimental results were tested on Tensorflow and the used GPU is Tesla P40.

VI. CONCLUSION
In this work, we present a unified neural network named MorseNet for simultaneous Morse detection and recognition in spectrograms. The applications scenario is the narrowband that contains multi-channel Morse signals. MorseNet combines two networks that perform well in signal detection and recognition, and it implements end-to-end training. For evaluation, simulated Morse signals with added real-world backgrounds are collected and divided into four datasets. The experimental results show that our method significantly outperforms previous methods, effectively improves four longstanding problems in the task, and is more robust in different SNRs and code speeds. In addition, compared to the twostage version method, our unified architecture improves the performance in both detection and recognition while speeding up the computation and reducing the model size.
In our future work, as the proposed MorseNet is taskoriented, it can be easily adjusted to apply to other signals with similar tasks instead of only Morse. In addition, for separate signal detection or recognition tasks, the corresponding branch divided from MorseNet could also be a good choice. Moreover, since there are more and more research studies that use the multi-task approach [35], [36], our unified network could also provide a new scheme for the multi-task architecture, which can be generalized to other problems consisting of multiple subtasks that are complementary. LING YOU received the Ph.D. degree from Information Engineering University, in 2000. His research interest includes signal analysis and processing. VOLUME 8, 2020