WEDA: A Weak Emission-Line Detection Algorithm Based on the Weighted Ranking

The <inline-formula> <tex-math notation="LaTeX">$\text{H}\alpha $ </tex-math></inline-formula> emission line in rest wavelength frame of optical spectra is valuable characteristics for nebulae detection. Searching and recognizing the spectra with <inline-formula> <tex-math notation="LaTeX">$\text{H}\alpha $ </tex-math></inline-formula> emission line from massive data are necessary for the further study, while the most of methods existed currently do not adapt to such spectral data, especially for the spectra with weak <inline-formula> <tex-math notation="LaTeX">$\text{H}\alpha $ </tex-math></inline-formula> emission line. To address this issue, a new algorithm (named WEDA) for detection of spectra with <inline-formula> <tex-math notation="LaTeX">$\text{H}\alpha $ </tex-math></inline-formula> emission line is provided in this paper. Firstly, the difference factor <inline-formula> <tex-math notation="LaTeX">$\mu $ </tex-math></inline-formula> between the line characteristics of the specific data is defined as its weight in recognizing of the whole lines table. Secondly, a tuning function <inline-formula> <tex-math notation="LaTeX">$\mathit {f(\tau, \delta)}$ </tex-math></inline-formula> based on the momentum formula is defined to update the weights during the process. In this step, the spectra with <inline-formula> <tex-math notation="LaTeX">$\text{H}\alpha $ </tex-math></inline-formula> emission line are analysed and classified as 3 different situations. The amount of spectra with <inline-formula> <tex-math notation="LaTeX">$\text{H}\alpha $ </tex-math></inline-formula> emission line is different in 3 different situations, so the speed of weight of update is different in 3 different situations. The weight of update helps us detect the data containing weak <inline-formula> <tex-math notation="LaTeX">$\text{H}\alpha $ </tex-math></inline-formula> emission line in the 3 situations. Based on this, a new integrated algorithm especially for the detection of the spectra with <inline-formula> <tex-math notation="LaTeX">$\text{H}\alpha $ </tex-math></inline-formula> is provided. In the end, by using several spectral datasets from the DR5 of LAMOST survey, experiments results indicate that the WEDA shows higher accuracy basically unaffected by the dataset size and the signal to noise ratio(SNR) than the other similar algorithms.


I. INTRODUCTION
With the development of technology, increasing amounts of spectral data have been obtained by astronomical telescopes. The challenge we face today is how to find the spectral data we need from massive amounts of spectral data. There is also a lot of work to analyze the spectral data [1], even facilities [2] Due to the large amount of work, a lot of the work cannot be finished manually, and some of the work can be considered to be classification tasks in machine learning. Therefore, the work seen as classification tasks can be finsihed automatically via computers; for example, the detection of the Hα emission line can be seen as a binary classification task that attempts to classify spectral data as {1, −1}, where 1 represents that this spectral data contains the Hα emission line, and −1 represents that this spectral data does not contain the Hα emission line. However, the results of many classification methods do not meet the requirements of The associate editor coordinating the review of this manuscript and approving it for publication was Tomasz Trzcinski. many complex data situations. Furthermore, data that meet our needs are often rare, which makes model training harder. The ranking algorithm proposed is used to solve these problems. To adapt to a complex data environment, we will weight the data to distinguish it. The ranking algorithm and weight algorithm are combined to highlight the data we need. In this paper, WEDA is used to find data with the Hα emission line on the LAMOST survey dataset.

A. MOTIVATIONS
The motivations of this paper can be summarized as follows: 1. The Hα emission line in the rest wavelength frame data are very valuable materials for the further study of nebulas in our Galaxy. Therefore, to search out and recognize them from the massive spectral data would be of great significance for the astronomers.
2. The Hα emission line always show some complex characteristics such as weak features, noise interference, various profiles, etc., which largely increased difficulty in searching and identifying. Design a specific method for this problem is necessary.
3. The ranking algorithm is alway a useful method in such areas, however, it is not suitable for solve the above problems directly.
Motivation 1: The Hα emission line can be used to detect nebulas, which is vital to work such as studying nebulas [3], star-forming galaxies [4] and late L dwarfs and T dwarfs [5]. The first step for these studies is to detect spectral data with the Hα emission line. However, data with the Hα emission line in many studies is obtained by manual identification [6], which has high precision but also high cost. When massive amounts of data need to be processed in research, such a large amount of work cannot be finished by manual identification. A new algorithm needs to be proposed to finish this work automatically. In this paper, the ranking algorithm and weight algorithm are combined to find spectral data with the Hα emission line.
Motivation 2: In recent years, an increasing number of data mining algorithms have been improved and have begun to be applied to astronomy research [7], for example, classification of star spectral data [8], finding young stellar populations [12], analyzing the variation characteristics of the sky background [9], star-galaxy classification [13] and detection of faint γ -ray sources [14]. These data mining algorithms have obtained good results in many astronomical studies. Hence, data mining algorithms are considered for use in detecting Hα emission line. Although some work, such as detecting emission lines, can be seen as classification problems, many traditional classification models, such as SVM [15] and ANN [18], cannot be directly used for this work due to complex characteristics. These algorithms need to be optimized before being applied to astronomical research, such as SVM [16] and RS [17]. Deep neural networks are powerful, but training requires a considerable amount of data instead of small-scale data sets [19]. This should be considered when the corresponding algorithm is proposed. Some clustering algorithms [10] have also been applied to spectral data [11].
Motivation 3: The ranking algorithm has been applied to some fields such as search engines [20] and recommendation systems [21]. In data mining areas such as information retrieval, there are many classic and excellent algorithms, for example, pageRank [22] and HITS [23]. Simple ranking algorithms cannot obtain all data with the Hα emission line. If a ranking function with good quality is learned [24], the task is likely to be finished. However, if all data are directly used in the ranking algorithm without data preprocessing, the ranking algorithm will have a large time complexity. Before application to big data sets, the ranking algorithm needs to be improved.

B. CONTRIBUTIONS
A novel integrated algorithm, called WEDA, is proposed in this paper based on the above motivations. The main idea of WEDA is that the ranking algorithm and changeable weights are combined to find data with the Hα emission line. First, the weights in WEDA are initialized according to differences between the specific data. Afterward, the weight update function is proposed based on a momentum function [25]. The weights will update according to changes in the data. The data are sorted by continuously selecting data during the weight update process, which forms an ordered sequence. When an ordered sequence is obtained, all data are classified only by a cutoff threshold. This algorithm can be used to find spectral data with the Hα emission line.
The contributions of this paper are as follows: 1. The difference factor µ between the line characteristics of the specific data is defined to initialize weights of primary information and secondary information.
2. The tuning function f (τ, δ) based on momentum function is designed as the basis for weights update.
3. A new integrated algorithm named WEDA especially for the detection of the spectra with Hα is proposed based on the difference factor µ and the tuning function f (τ, δ). Meanwhile, this algorithm is evaluated by using the spectra from the DR5 of LAMOST survey.

C. ROADMAP
The rest of the paper is organized as follows. Section II introduces work related to WEDA. Section III first gives the main idea of WEDA. Then, we give an in-depth introduction to WEDA, and a theoretical analysis of WEDA is performed. Then, Section IV discusses all experimental results and the quality of the algorithm itself to compared algorithms.

II. RELATED WORK
Traditional detection methods for nebulae depend on the observation of infrared telescopes and radio telescopes. However, the detection of lower-density nebula is not ideal due to factors such as resolution. The Hα emission line in a rest wavelength frame can be prevented from being absorbed by the star population, and it can be discovered and measured. The SNR of LAMOST spectral data is higher on the r-band, and therefore, the detection of the Hα emission line has a low dependence on the overall quality of the spectral data. The Hα emission line in a rest wavelength frame is used to detect nebulae. With the development of observational means, increasing amounts of spectral data are obtained by the LAMOST telescope. A considerable amount of meaningful information can be mined from the massive amounts of data. Spectral data with Hα emission line in LAMOST [26] can be chosen for use in WEDA. Previous work has found data with Hα emission line and used these spectral data, for example, to study peculiar A-type stars [27] and to search for classical Be stars [28]. Among spectral data with Hα emission line, some data only have a weak Hα emission line, which must be judged by experts. Many algorithms cannot find data with weak Hα emission line. From an astronomical background, we can make use of other related information to judge the existence of the Hα emission line. Although the Hα emission line is disturbed by noise, some other emission VOLUME 8, 2020 lines chosen from the frame can verify its existence, such as Hβ, NII, and OII. If data show high confidence in these emission lines, it is likely that these spectral data contain Hα emission line. Furthermore, neighbor data can also be used to demonstrate the existence of the Hα emission line in general, but there are some outliers in which the neighbor data do not contain the Hα emission line, while the data do contain the Hα emission line. Compared with neighbor data, the confidence of other emission lines chosen from the frame is more persuasive.
The detection of the Hα emission line can be transformed into a binary classification problem in imbalanced data. There are many applications that can be transformed into this kind of problem: in the diagnosis of disease, the detection of breast cancer [29], [30] and the choice of cardiac care [31] can be seen as binary classification problems.Among financial problems, credit-card fraud detection [32] and bankruptcy forecasting [33] can also be solved in this way. In information security, the same is true of intrusion detection [34] and spam detection [35]. The most serious problem of this type of binary classification is that the imbalance between the numbers of the two classifications will lead to a difference in the sensitivities of predictions [36]. The sensitivities of predictions will be more inclined to the majority class target [37], and the minority target class is often ignored. However, we are often interested in the minority target class, which also prompts us to not just look at the overall precision of the algorithm. For imbalanced classification, there are four aspects of interest: the training set size, class prior, cost matrix and placement of the decision boundary [38]. The training set size idea is to alter the size of training sets, which increases the number of minority target classes or decreases the number of majority target classes [39]. Under-sampling or oversampling makes the number of the two classifications the same, which transforms the imbalance into balance. This method is simple to implement and can solve the imbalance problem [40], [41]. Representative algorithms that use this strategy are SMOTE [42] and easyEnsemble [41]. The cost matrix strategy [43] increases the weight of misclassified data so that the model can fit these misclassified data; examples include adaboost [44]. The theory of class priors is based on the assumption that the distribution of the positive class is known [45]. The placement of the decision boundary moves the classified threshold instead of increasing and decreasing the amount of data [46]. The above four methods are usually used to solve imbalanced classification. This data skewness problem also has adverse impacts on parallel operations [47]. For parallel operations on imbalanced data, algorithms also need to be designed according to the specific situation.
Ranking algorithms are often used in information retrieval and recommendation systems [24] to mine the data we need from massive amounts of data. The machine learning field has considered bipartite ranking algorithms [48]. These algorithms can assign a score to each data point, and all data are sorted by this score. The scores of positive instances are higher than the scores of negative instances [49], which helps us to classify all the data. To achieve this goal, the bipartite ranking algorithm needs to determine a function in which the rank of positive instances is higher than the rank of negative instances [50]. Currently, there are many studies on how this bipartite ranking algorithm can be applied to data mining, such as CBR [51] and the Bayesian multiple kernel bipartite ranking model [52]. Therefore, the bipartite ranking algorithm can be used to find data with the Hα emission line among massive amounts of data, and the key aspect of this algorithm is how to design the score function.
To date, there are many studies about binary classification on imbalanced data and bipartite ranking algorithms. However, There is no useful approach to detect data with weak Hα emission line from spectral data sets. In this paper, a newly designed bipartite ranking algorithm is used to detect data with weak Hα emission line.

III. SEARCH METHOD
In this section, the method of WEDA will be introduced in detail. An overview of the whole algorithm is shown in Section A, and in Section B, we give more in-depth information on the contents of Section A. The theoretical analysis is displayed in Section C.

A. THE MAIN IDEA
The initial spectral data are denoted by A, and they need to be preprocessed to extract meaningful information. In this paper, the entire dataset is divided into three parts based on the SNR, the ranges of which are 0-10, 10-50 and above 50, because different SNRs will lead to different amounts of data with Hα emission line and different qualities of the performance of WEDA. The three parts of the data are processed separately and are denoted by {A 1 , A 2 , A 3 }. Then, we preprocess each part of the data in A. Preprocessing can help to exclude some of the data that clearly lack the Hα emission line. The data that cannot be classified by data preprocessing are denoted by D i ∈ A i , and for these data it is necessary to utilize other useful information around the Hα emission line. With this information, we can obtain D in = {(a n , b 1 n , b 2 n , . . . , b j n , y n )} N n=1 , where a n is the sum of the confidence of the Hα emission line and the Hβ emission line; there are two b j n in this paper: one is the sum of the confidence of two NII emission lines and two SII emission lines, and one is the sum of the confidence of six emission lines in data formed by superposition of neighbor spectral data. y n is the data classification, that is, whether the data contain the Hα emission line, where y n ∈ {−1, 1}. N is the number of data points that are not determined by data preprocessing. Before WEDA is processed, we combine all the secondary information by predefined weights {ϕ 1 , ϕ 2 , . . . , ϕ j }. The formula is as follows.
WEDA can utilize the information {(a n , b n , y n )} N n=1 to obtain the confidence {C n } N n=1 of all data by {ω 1 , ω 2 }. The purpose of considering the confidence C n of the data is to assess the current possibility of the Hα emission line.
The currently highest confidence C n of data χ * n is put into an ordered sequence of probabilities χ and removed from D i . The most likely sample χ * n is selected from the undetermined data D i by the formula below.
Finally, the confidence measures the probability of each data point having the Hα emission line. An ordered sequence of probabilities χ = {χ * 1 , χ * 2 , . . . , χ * n } is obtained, which is in descending order.

B. THE WEIGTH UPDATE algorithm(WEDA)
In section A, we finish the overview of the whole algorithm. The entire algorithm process will be described in detail in this section. The algorithm contains two steps: extracting information and calculating confidence. Data preprocessing is used to extract information and exclude data without an apparent Hα emission line from the entire spectral data set. Then, WEDA is used to calculate the confidence based on information extracted from data preprocessing.

1) EXTRACTING INFORMATION
From an astronomical background, other emission lines b 1 n chosen from the frame and neighbor data b 2 n can demonstrate the possibility of the existence of the Hα emission line. At this stage, we not only extract information of the Hα emission line and Hβ emission line a n but also determine the information of other emission lines b 1 n chosen from the frame and six emission lines in the data formed by the superposition of the neighbor spectral data b 2 n . In this paper, we choose four emission lines from the frame (NII:6548,6584, SII:6717,6731) to obtain b 1 n and calculate the confidence of six emission lines (Hβ:4862, Hα:6564, NII:6548,6584, SII:6717,6731) in data formed by superposition of neighbor spectral data to obtain b 2 n . Among them, three kinds of information need to be extracted, which are the quality of the Hα emission line and Hβ emission line a n , the quality of other emission lines b 1 n chosen from the frame and the superposition of neighbor spectral data b 2 n . The four other emission lines are chosen from the frame, which are two NIIs and two SIIs. The description of the process of extracting information is shown in Algorithm 1.

a: THE EVALUATION OF EMISSION LINES
To evaluate an emission line, we give an appropriate detection wavelength range for the emission line that can avoid covering other emission lines. After fixing the specified wavelength range, we can extract all peak values to check for the existence of this emission line. If there are no peak values around the emission line location, we conclude that this emission line does not exist; furthermore, the confidence of this emission line is recorded as 0. The wavelength of the ideal peak value should be close to the wavelength of the specific emission line, and the ideal peak value should Algorithm 1 Extracting Information Input: dataset A i ; distance threshold θ 1 Output: unclassified data D i for Hα emission line and Hβ emission line do extract information of each emission line if the value of Hα emission line is 0 then a n = 0 y n = −1 end if end for a n = sum(Hα emission line and Hβ emission line) for four related emission lines chosen from frame do extract information of each emission line end for b 1 n = sum(four related emission lines) be symmetrical. The peak value closest to this emission line is selected, and the wavelength distance between the peak value and the emission line is calculated. The smaller the wavelength distance is, the higher the confidence of this emission line. The inverse of the wavelength distance is obtained. To assess the symmetry of the peak value, the shape of this peak value also needs to be evaluated. The left and right sides of the ideal emission line should be similar, so the height and width of the left side and right sides are used for comparison. The widths of the two sides are recorded as w l , w r according to the change in slope, and the heights of the two sides are also recorded as h l , h r . The difference d can be obtained by the width and height of the two sides, and it is defined by the following formula: The smaller the difference between the two sides is, the higher the confidence, and therefore, the inverse of the difference is also obtained. The inverse of the difference and the inverse of the wavelength distance together form the value of this emission line. The primary information a n includes two emission lines, Hα and Hβ. The precondition of spectral data with Hα is that there is a peak value at the Hα emission line location. Therefore, if the confidence of the Hα emission line is 0, the classification of data can be marked as −1. Only if the confidence of the Hα emission line is not 0 can the confidence of other emission lines be calculated. For primary information, the confidence of the Hβ emission line also needs to be calculated. The primary information a n is the sum of the confidences of these two emission lines. When the primary information a n is obtained, the four emission lines in the frame need to be calculated to obtain the value b 1 n . The four emission lines are two NII emission lines and two SII emission lines. Their confidence calculation method is the same as the above method. The sum of the four confidences of the emission lines is the value b 1 n .

b: SUPERPOSITION OF NEIGHBOR SPECTRAL DATA
First, a distance threshold θ 1 needs to be given to determine the neighbor spectral data of all spectral data. After extracting the latitude and longitude from the spectral data, the distance between objects is calculated by the Euclidean distance.
If we want to obtain the neighboring data of a spectral data point, this spectral data point needs to be compared with all spectral data to obtain all distances. If this distance is smaller than θ 1 , the spectral data corresponding to this distance will be considered neighboring data of this data point and are added to the current data point's distance set ds n . When all data are assigned, each data point has a distance set ds n . We obtain b 2 n by the distance set ds n . The idea is that if most neighboring data have a high probability of having the Hα emission line, the probability that the data point has the Hα emission line will also be high. To obtain the probability that all neighboring data have the Hα emission line, all neighboring data will be superposed. We only need the superposition of Six emission lines' specified wavelength ranges instead of all wavelengths. Before the superposition of the specified wavelength range, the flux in the specified wavelength range needs to be normalized to eliminate differences between all neighboring spectral data. After completing the superposition of spectral data, the data formed by superposition of neighbor spectral data is used to obtain b 2 n . The confidence of six emission lines will be evaluated by the above evaluation method of emission lines. The mean of the confidence of the six emission lines is b 2 n . Finally, b 2 n can be obtained for each data point to support the existence of the Hα emission line.
All information has been extracted by data preprocessing. Furthermore, some data have been determined by data preprocessing. For unclassified data D i = {(a n , b 1 n , b 2 n , y n )} N n=1 , the last two attributes b 1 n and b 2 n are obtained to show whether the unclassified data contain the Hα emission line.

2) DATA RANKING
For data D i that cannot be classified by data preprocessing, the WEDA is proposed to improve the weight algorithm. Weights in the weight algorithm are fixed and unchanged, while len(C) < number of unclassified data points do for a n , b n in D i do c n = a n × ω 1 + b n × ω 2 end for which makes it unsuitable for complex environments. Thus, the WEDA is proposed to improve the weight algorithm. We can have weights update constantly to adapt to complex situations. We first need to combine b 1 n and b 2 n to obtain b n from predefined weights {ϕ 1 , ϕ 2 }; b n is shown below.
The whole process of data ranking is shown in Algorithm 2. In real data, there are many data situations. To adapt to different data situations, the initial weights needed to be determined by the data. Based on primary information a n and secondary information b n , all unclassified data D i are sorted. The top K data can be chosen to calculate initial weights {ω 1 , ω 2 } by the difference between primary information a n and secondary information b n . The difference between primary information a n and secondary information b n is the weight difference. There are many ways to calculate the difference, and the difference should be related to specified values rather than the total difference. In view of the above, we need to determine the maximum difference in the top K data. The weight ω 1 is related to the primary information plus the maximum difference, while the weight ω 2 is related to the secondary information minus the maximum difference. All weights {ω 1 , ω 2 } are limited to values between 0 and 1. We can choose the top K data from these data to represent the data that can be classified without secondary information b n ; the selection method is as follows: 1. First, the primary information of all data is normalized as [0,2]. All data with primary information a n larger than 1 can be separated by the following formula.
where κ is an integer that represents which level the data should be assigned to, and the range of κ is [0,9]. The values of all levels λ κ are initialized to 0. The value of the level is λ κ plus 1 when κ is equal to the level. After that, all data have been processed, and then we use it to choose K. 2. Second, if the value in the previous level λ κ is two times larger than the value of the next level λ κ−1 and the value of the next level λ κ−1 is larger than the minimum threshold of 4, the next level λ κ−1 is called the cutoff level. The reason that the minimum threshold is set to 4 is that K should be limited to prevent too few numbers being used, making the factor of two meaningless. All data in the levels {λ κ } 9 k=κ before the cutoff level are chosen to be the top K data, and half of the data in the cutoff level λ κ−1 are also chosen as the top K data.
Before calculating the difference of the top K data, the primary information a n and secondary information b n must have uniform standards. In this paper, a sorted ranking is applied to create a uniform standard. The higher the ranking is, the larger the value. The formula is defined as follows.
where represents the total number of data and γ n is the ranking of the data. Both primary information and secondary information can have their own V n , and the two V n are subtracted to obtain the difference d n . d is the maximum of all differences d n .
Based on the initial weights {ω 1 , ω 2 } obtained previously, our purpose is to have the weight ω 1 related to the primary information slowly decrease when the weight ω 2 related to the secondary information slowly increases. We divide the process of changing weights into three stages. Because we assume that the data that have the Hα emission line must have a high primary information value a n or secondary information value b n , the two weights {ω 1 , ω 2 } should be quite different rather than close. In the first stage, the data that only depend on primary information a n are found, and then they remain in the second stage for a short time. In this period, the two weights {ω 1 , ω 2 } are very close, so we can detect data that have both a value of primary information a n and a value of secondary information b n slightly higher than those of other data. In the third stage, the weight ω 2 related to secondary information is greater than the weight ω 1 related to primary information, and data are chosen that only depend on secondary information b n . When the number of chosen data points is greater than the threshold of 0.7 or the secondary information of the data b n is very small, the algorithm will go to the next iteration. In the second iteration of processing, the weight ω 1 is obtained by the weight ω 1 in the first iteration minus the iteration threshold 0.2. This is because the previous iterative process detects the most data that depends on primary information a n , so the weight ω 1 does not need to be set as large as before. However, the weight ω 1 in all iterations is greater than or equal to 0.6, which makes the primary information dominant at the first stage of processing. The weight ω 1 is calculated by the following formula.
Finally, the above three-stage process is repeated until all data are processed. The data become increasingly dependent on secondary information, so the speed of descent should increase. The spacing distance can be calculated by a formula defined as follows.
where τ is the learning rate, δ is the dissipation coefficient, a x * n represents the value of primary information of the chosen data, and a x * n−1 represents the value of primary information of the next data. W is the sum of all previous weights. The weights {ω 1 , ω 2 } are updated by the following formula.
{ω 1 , ω 2 } can be used to calculate all the data's confidence {C n } N n=1 at the current stage. In practical applications, we can take the first iteration into account. In the first iteration, in this paper, we set the minimum gap in the first two stages and the maximum gap in the second stage to prevent the weights from decreasing too quickly or too slowly. The maximum gap and minimum gap are also set according to the number of data in this stage. All data can be sorted through confidence {C n } N n=1 , and we can choose the data corresponding to the highest confidence C n ; the data with the highest confidence C n are considered to have the Hα emission line. When the number of chosen data points is greater than the threshold 0.7 or the secondary information of data b n is very small, the weight-changing process will go to the next iteration. In all subsequent iterations, complete iterative processing does not include three stages, but we need to set gap restrictions. After multiple iterations, all data have been processed and sorted.

C. THEORETICAL ANALYSIS
In this algorithm, data preprocessing is indispensable to the entire algorithm because it greatly reduces the amount of data that the algorithm operates on, which can significantly reduce time complexity and spatial complexity. After removing most of the easily determined data, only a small portion of undetermined data will be used in WEDA. The time complexity of WEDA is O(N 2 ), where N is the number of data points that fail to be determined by data preprocessing. The learning rate τ and dissipation coefficient δ have a great impact on the algorithm. The two parameters can affect which data the algorithm prefers. The larger the value of the two parameters is, the faster the weight ω 1 decreases; the algorithm can select most of the data with more secondary information, which indicates that the algorithm is biased towards data that only depend on secondary information. The smaller the value of the two parameters is, the more slowly the weight ω 2 decreases; the data with more primary information will be chosen by this algorithm, which means that the algorithm prefers data that only depend on primary information. In this algorithm, the speed at which the weights decrease affects the data selected by the algorithm; therefore, we need to limit the size of the weight decrease.
The three subsets have their own ordered sequences. The higher the rank in the sequence is, the higher the probability of the appearance of the Hα emission line. We only need a certain threshold as the boundary value of the two types of data, and the boundary value is called the cutoff threshold.

IV. EXPERIMENTAL RESULTS
In this section, we rigorously evaluate our method from two perspectives on an Intel(R) Core(TM) i7-6700HQ with 8.0 GB memory with a Windows 10 operating system. In addition, to verify whether our method works efficiently on spectral data, we choose five different sizes of data as classification data and six different classification algorithms to compare. The update of the weights can be analyzed based on figure 6.
We implemented our method and all the classification algorithms with Python. In this paper, spectral data from LAMOST DR5 are used in all algorithms. For the description of the characteristics of emission lines during data preprocessing, the wavelength range used for the description of the characteristics needs to be determined. Table 1 shows the wavelength ranges of the six emission lines. The six wavelength ranges are used to extract the emission lines' information. The distance threshold θ 1 is initialized to 1, and the predefined weights {ϕ 1 , ϕ 2 } are set to {0.8, 0.2} for all tasks. The learning rate τ and the dissipation coefficient δ are set to 0.1 when the number of data points is fewer than 20000. In practical applications, we need to constrain the gap to decrease and increase the control weights. In the first stage of weight update, there is a minimum gap to prevent the primary weight from decreasing too slowly. The minimum weight related to the primary information must be greater than 0.8, and the number of data points chosen with a level greater than or equal to 3 should be 0.9 of the total number of data points with a level greater than or equal to 3 when the number of all data points is smaller than 20000. The minimum gap is based on the ratio of the weight range related to primary information and the number of data points for which κ is greater than or equal to 3. The minimum gap and maximum gap also need to be set in the second stage. The weight related to primary information is limited to a range of 0.5 to 0.8. The difference between the minimum gap and maximum gap is the difference in the number of data points in which κ is greater than or equal to 1 and smaller than 3.0. The number of data points in the minimum gap is set to 0.5, and the number of data points in the maximum gap is set to 0.3. In subsequent iterations, we can set the minimum gap and the maximum gap for complete iterative processing. The number of data points in the minimum gap and maximum gap are set to 0.2 and 0.25, respectively.
For each classification method, we repeat the same experiments multiple times with different amounts of spectral data and obtain the execution time, recall rate and precision. For each experiment, our method and the compared methods share the same data for a fair comparison. The process of updating the weights in 19411 data points is also shown and analyzed. The data sets used in this paper are described in detail in section A. The other implementation details and experimental results for each task and algorithm are shown in sections B and C.

A. DATA DESCRIPTION
In this paper, the dataset for all algorithms is obtained from LAMOST DR5 V3. LAMOST, also called the Guo Shou Jing Telescope, can take 4000 spectral images in a single exposure. LAMOST DR5 spectral data were obtained during a six-year sky survey from October 2011 to June 2017, and they include 4154 astrometric fields and 9026365 total spectral data. The number of high-quality spectra that have an SNR greater than 10 reached 7775981. For different amounts of data that all the experiments require, we only need to set different position constraints and choose star and galaxy data. The amount of data is 617, 3611, 8202, 12782, and 19411.
The standard data with Hα emission line are shown in figure 1. The horizontal coordinate represents wavelength and the vertical coordinate represents flux. The emission lines at specified wavelengths are obvious and have good characteristics in figure 1. We find that these data with Hα emission line also contain two NII and two SII emission lines and an Hβ emission line. This is because these six emission lines are related. These correlations are divided into two categories: one is the correlation between the Hα emission line and the Hβ emission line, and the other is the correlation between the Hα emission line and four other emission lines. Based on the above idea, data with the Hα emission line can be divided into two categories. However, these data are easy to detect because their characteristics are very obvious and there are six emission lines. Many data in reality can be disturbed by noise, which weakens the characteristics so that it is difficult to distinguish whether the data contains the Hα emission line. These data with weak characteristics require other emission lines to judge whether they contain the Hα emission line.
In general, spectral data with the Hα emission line also contain other related emission lines, which means that if there are only obvious characteristics in the Hα emission line location but no characteristics in other emission lines, it cannot be demonstrated that the spectral data contain the Hα emission line. The spectral data with the Hα emission line are bound to contain other emission lines, such as the Hβ emission line. Based on the above two correlations, we can choose spectral data with the Hα emission line.
The first correlation is between the Hα emission line and the Hβ emission line. When two SII and two NII emission lines are disturbed by noise, this kind of data still contain an Hβ emission line, and the Hα and Hβ emission lines are strongly related. Figure 2 shows some examples with this connection. As seen from the figure, these data contain at least the Hα emission line and Hβ emission line, but each data point has its own unique situation. In general, data with the Hα emission line have other emission lines in the frame, such as NII, SII and Hb. In these figures, each data must contain an Hβ emission line. Moreover, different spectral data have different situations; for example, in figure 2-a, the spectral data have an obvious Hβ emission line and two weak NII emission lines, and the spectral data in figure 2-b also have an NII emission line and an SII emission line, but not the Hα emission line. Figure 2-c and figure 2-d also show that spectral data with the Hα emission line also have an Hβ emission line at least. Based on these spectral data, the quality of Hα and Hβ emission lines are used as primary information. Therefore, this type of data can be judged directly by the Hα emission line and Hβ emission line instead of requiring additional information such as frame and neighbor data classification. This type of data only needs the weight related to primary information because some data of this type do not have other emission lines chosen from the frame or are surrounded by data without the Hα emission line. If the weight related to the secondary information exists, or is greater than the weight related to the primary information, this type of data is difficult to detect. Moreover, there are some special cases, for example, it may be observable by the human eye that spectral data contain the Hα emission line and other emission lines but no Hβ emission line. This type of data can be detected by the second iteration or even the third iteration in the algorithm.
The second correlation is between the Hα emission line and four other emission lines; this is shown in figure 3. In figure 3, the spectral data have an SII emission line or NII emission line but not an Hβ emission line. In contrast to the above spectral data, these spectral data do not require an Hβ emission line. Figure 3-a and figure 3-b have two NII emission lines, and an SII emission line and three emission lines can demonstrate the existence of the Hα emission line. The two spectral data in figure 3-c and figure 3-d have two SII emission lines and two NII emission lines. The more emission lines there are, the higher the confidence. Based on these data situations, the quality of NII and SII emission lines are used as secondary information. For this type of data, the quality of the Hα emission line and Hβ emission line used as primary information is useless and even interferes with detection. The detection of this type of data only depends on four emission lines in frame and neighbor data classification. Therefore, only the weight related to the secondary information needs to be used to detect data with the Hα emission line. From the four images in figure 3, all data are ultimately found to VOLUME 8, 2020  have three emission lines in the frame. Most data with weak Hα emission line have SII emission lines or NII emission lines, or even both emission lines. The neighboring data are also used to demonstrate the existence of the Hα emission line. If most neighboring data contain the Hα emission line, the data point is more likely to contain the Hα emission line. For frame and neighboring data classification, the frame is more useful than neighboring data classification because these data may be outliers and data with the Hα emission line generally has other emission lines. For the frame, there are ultimately two emission lines, and the frame can be used to demonstrate the existence of the Hα emission line. If the frame only has one emission line, the frame information becomes useless. The secondary information can be dominant when data do not contain an Hβ emission line. The weights in the algorithm need to be adjusted to adapt to this situation, so that the algorithm can detect data containing other emission lines in the frame.

B. ANALYSIS OF THE QUALITY OF WEDA
The objective of the experiments presented in this subsection is to analyze the quality of WEDA when data sets with different numbers are used as test data sets. The five data sets are used to test the quality of the algorithm, which can efficiently test the quality of the algorithm for various amounts of data.
Two perspectives are used to analyze the quality of our algorithm. The first perspective is different SNRs. Before preprocessing, the entire dataset is divided into three parts based on the SNR: 0-10, 10-50 and above 50. Experiments with different SNRs can help us to determine which area's SNR data are difficult to classify and reduce the overall quality of this algorithm, which is a drawback. of the algorithm. The second perspective is the amount of data. Each part is divided into three parts based on its own situation and cutoff threshold. We choose a cutoff threshold such that the recall rate is 1 to compare the quality of our algorithm under different amounts of data.

1) DIFFERENT SNRs AND DIFFERENT CUTOFF THRESHOLDS
In this subsection, the influence of different SNRs on quality is compared. As mentioned above, we use five data sets with different amounts of data and obtain the recall rate and precision for each. The results are shown in figure 4. The descriptions of five data sets are shown in Table 2. In figure 4, each color represents a data set. In each line, there are three images: the images in the first line show the recall rates on the five data sets, and the images in the second line show the precision on the five data sets. The horizontal axis represents the cutoff threshold, which is the ratio of the number of data points with the Hα emission line in the data set D i and the number of the entire data set D i ; the data set is all data points such that it cannot be determined whether they have the Hα emission line by data preprocessing.
The first cloumn of figure 4 demonstrates the quality of the algorithm on five data sets when the SNR is between 0 and 10. The recall rates in the five data sets are shown 97994 VOLUME 8, 2020 FIGURE 4. This figure shows the influence of different SNRs on quality is compared. Five data sets that are 617, 3611, 8202, 12872, 19411 are used and the recall rates and precisions are obtained. The first column represents data that SNR is between 0 and 10. The first line in the first column that is figure a is recall rate of data that SNR is between 0 and 10. The second line in the first column that is figure d is precision of data that SNR is between 0 and 10. The second column represents data that SNR is between 10 and 50. The first line in the second column that is figure b is recall rate of data that SNR is between 10 and 50. The second line in the second column that is figure e is precision of data that SNR is between 10 and 50. The third column represents data that SNR is above 50. The first line in the third column that is figure c is recall rate of data that SNR is above 50. The second line in the third column that is figure f is precision of data that SNR is above 50. in figure a. From figure a, we can find that as the amount of data increases, the cutoff threshold consistently become stable and eventually converge to a stable value of 0.65. The data preprocessing has excluded some data that are believed to not contain the Hα emission line, and these data are not used in the following processing of the data ranking. The data preprocessing cannot identify data with the Hα emission line. When the cutoff threshold is 0, the recall rates are 0 because no data are believed to contain the Hα emission line in the stage of data preprocessing. The whole process of data ranking only processes data that cannot be determined by data preprocessing. The cutoff threshold is approximately 0.7 when the amount of data is 617. During this time, the amount of data used in the algorithm is small, which makes the data situation simple and the confidence of data with the Hα emission line higher in general, so that a larger cutoff threshold can find all data with the Hα emission line. As the amount of data increases, the data situation becomes gradually comprehensive, and there are some special cases among these data. In the figure, when the amount of data reaches 19411, the cutoff threshold begins to remain stable. Compared to the cutoff thresholds in small data sets, the cutoff threshold was close to 0.65 at this time and did not change much. The reason for this phenomenon is that the number of data situations has been increasing, however, the number of data points with the Hα emission line has also increased.
The SNR of this data set is 0 to 10, and the SNR of all the data is relatively close, so when the amount of data is similar, their cutoff thresholds are also close. Because the SNR is relatively low, the confidence will be disturbed by noise. Some data that depend on primary information can only be determined during the second iteration, or even the third iteration, which makes the cutoff threshold larger.
Figure d shows all precision values in the five data sets in which the SNR is 0 to 10. The green line represents that the amount of data is 19411 and is the first to begin to slump. Because the SNR is low, the data that depend on primary information are disturbed by noise. The weight related to the primary information remains dominant at the stage that identifies data that depend on primary information, therefore, some data that do not contain the Hα emission line also have high confidence. The orange line representing 3611 has the largest drop when the cutoff threshold is 0.15. Due to the small amount of data with the Hα emission line in small data sets, even if the amount of unclassified data is small, the precision will slump. When the cutoff threshold is 0.15, the recall rate does not reach 1, which shows that some spectral data without the Hα emission line have a high ranking. The pink line starts to drop when the recall rate has reached 1. Small data sets have simple data situations, so misclassified data are relatively less common. Other lines also show cases in VOLUME 8, 2020 which the line only drops slightly because the amount of data is large.
The number of data points for which the SNR is between 0 and 10 is the smallest of the three parts, and the amount of data for the Hα emission line is also smaller. In a small dataset, no lines have a steady trend, because the data situation is not comprehensive and the value generated by the data preprocessing may be inaccurate in classifying some data due to the lower SNR.
The second cloumn of figure 4 shows the recall rate and precision of datasets for which the SNR is between 10 and 50 using different sizes of data sets. All recall rates in the five data sets are shown in figure b. When the amount of data is small, the cutoff threshold varies greatly. For example, the pink line reaches 1 when the cutoff threshold is 0.75. From this point of view, the cutoff threshold has no reference value at this time, and the true cutoff threshold cannot be determined. One of the reasons for this result is that small data sets have unique situations that are a small part of the overall situation. From figure b, the pink line, green line and blue line reach 1 at different cutoff thresholds, and the three cutoff thresholds have great differences. As the number of data points increases, the cutoff threshold will slowly stabilize. The green line representing 19411 reaches 1 when the cutoff threshold is 0.55, and the precision falls to 0.94. The pink line reaches 1 when the cutoff threshold is 0.75. The blue line reaches 1 when the cutoff threshold is 0.6. This is mainly because data that have an SNR between 10 and 50 are complicated, and the range of the SNR is too large, resulting in a large difference in the SNR of each data point. It is difficult for the algorithm to detect all data perfectly, which can be seen below in figure e. When the recall rate reaches 1, most precision values slowly declined to 0.9.
The quality of the data for which the SNR is between 10 and 50 is very steady, and the precision is better than that of the data for which the SNR is between 0 and 10. As mentioned before, As the SNR increases, the characteristics of emission lines in spectral data become increasingly obvious. When the number of data situations begin to increase, the result does not change very much. However, the data with the Hα emission line become increasingly complicated as the amount of data increases. Some data are regarded as outliers for which the values of both primary information and secondary information are small, which makes the confidence low. The ranking of these data is decreased. Therefore, when the recall rate reaches 1, the precision also drops slightly. Eventually, the precision can become low.
It can be seen that the precision values are not high when the recall rate is 1. This is because the confidence of some special data is low and it is ranked lower in the sequence. There is often difficulty in extracting valid feature information of such data through data preprocessing.
The results of data for which the SNR is greater than or equal to 50 are shown in the third column of figure 4. The line is not similar to the above two results and has greatly changed. From figure c, the recall trend of the pink line representing 617 is not smooth. It is easy to detect the data with the Hα emission line in the datasets with a higher SNR. In this range of SNR, data with the Hα emission line have high confidence, and many data can be determined by data preprocessing. There is only a small amount of data left, including some data with the Hα emission line. The value of the primary information of data with the Hα emission line is high because the characteristics of this spectral data are very obvious, so it is easy to detect. For the green line representing 19411, when the cutoff threshold is 0.6, the green line reaches 1. From figure f, all lines except the blue line remain at 1 when the recall does not reach 1, which shows that all spectra are correctly classified. The blue line representing 12872 drops to 0.5 when the cutoff threshold is 0.1. This is because there are misclassified spectral data initially. Then, the blue line starts to rise to 0.91 until the cutoff threshold reaches 0.7, which shows that the rest of the spectral data are correctly classified. On the whole, spectral data with SNR above 50 are still easy to classify. When the SNR is greater than or equal to 50, the characteristics of all data with the Hα emission line are very obvious.
We analyze the quality of the three parts of the data from the above six figures. Data for which the SNR is 50 and above are easy to classify based on high confidence in data with the Hα emission line.

2) DIFFERENT AMOUNTS OF DATA
After analyzing the recall rate and precision, the impact of the amount of data on the algorithm is analyzed in this subsection. The algorithm can divide all data into three parts based on SNR, and each part of the data is processed separately. According to the above results, we choose a cutoff threshold based on the recall rate. The requirement of the experiment is to identify all data with the Hα emission line, so the recall rate should reach 1. Under the premise that the recall rate reaches 1, the cutoff threshold is chosen to calculate the overall precision. Five different sizes of datasets are used in the experiment. The result is shown in figure 5.
In figure 5, the green bar represents the precision of data which SNR is between 0 and 10, and the blue bar represents the precision of data which SNR is between 10 and 50, and the precision of data which SNR is above 50 is represented by the orange bar. The relationship between the SNR and the precision can be found by the comparison of three bar. In general, SNR is larger and precision is higher, which also is demonstrated by figure 5. However, data set A and data set D have an abnormal situation. The precision of data which SNR is between 0 and 10 is the highest. In data set A, the amount of data is small. Even if the number of misclassified data is small, the precision of data is lower. In data set D, both data sets that SNR is between 10 and 50 and SNR is above 50 have special situations. The data that SNR is between 0 and 10 in data set D only has simple situation. From five bars representing average, are consisted of black dotted line, the precision falls from 617 to 12872 due to the more complicated data situation and larger dataset with Hα emission lines. 97996 VOLUME 8, 2020 Therefore, the quality of the algorithm is still not stable. At the same time, the amount of data has a great influence on the algorithm. A small dataset only represents a part of the situation. As the amount of data increases, the precision changes. The overall trend of the pink line is downward, while the precision begins to stabilize at 0.91 in the dataset sized 19411. The larger the dataset, the more data situations it contains; this stabilizes the algorithm, so we can see that the precision is stable when the data size is increased from 12872 to 19411. For large data sets, there are too many data situations to accurately extract information from spectral data. The height of green bar starts at 0.96, which shows that the algorithm still has misclassified data in small data sets. There should be less misclassified data with an SNR between 10 and 50. The height of green bar eventually increases slightly, but the amplitude is small, which shows that the data situations remain stable.

3) WEIGHT ANALYSIS
From the above analysis, we find that the SNR affects this algorithm because data sets in different SNRs have their own data situations, which affects the weights in the algorithm. In this subsection, we will analyze how the weights update in the process of this algorithm. This data set contains 19411 pieces of data and is put into WEDA to obtain the figure 6 showing the updates of the weights.
The three images in figure 6 represent the update of weights in three SNRs. The different horizontal coordinates in the three images represent different ranking numbers, and figure 6-a has the highest ranking because it is difficult for data with an SNR between 0 and 10 to be identified by data preprocessing, and figure 6-c has the lowest ranking due to the high SNR, which makes data easier to identify by data preprocessing. The different SNR values lead to different ranking numbers due to different amounts of data complexity. In contrast to figures 6-a and 6-b, the initial primary weight in figure 6-c is only 0.75. From the introduction of this algorithm, initial weights are determined by differences in specific data. The initial primary weight in figure 6-c is only 0.75, which shows that the difference of data that is not determined by data preprocessing is small and the data situation for high SNRs is relatively simple.
The processes of the algorithm in figures 6-a and 6-b have three iterations. Both processes have almost the same overall trend and are different in parts. The update of weights is determined by the difference of primary information. If primary information of data with a closed ranking has a large difference, the weight related to the primary information tends to decrease faster. The data in figures 6-a and 6-b can be seen to have low SNR and complex data situations, which require multiple iterations to find all data with the Hα emission line. In figure 6-a, the weight related to the primary information at the end of the first iteration does not fall to 1, which indicates that most data with an SNR between 0 and 10 that are undetermined by data preprocessing depend on secondary information and that the difference in secondary information in these data is small. The figure 6-c shows that the detection of data with the Hα emission line only requires the first iteration. The weight related to primary information falls from 0.75 to 0.15 after the first ranking. One of the reasons for this phenomenon is that data with a high SNR have very obvious characteristics, which indicates that data with the Hα emission line have a large amount of primary information and it is easy to distinguish these data.
For WEDA, data are classified when the recall rate is 1. We analyzed the quality of this algorithm from the perspective of data size and SNR, selecting the case where the recall rate is 1.

1) PRECISION AND RECALL RATE
The experiments are carried out on the five different data sets, and all algorithms' recall rates and precisions are obtained. The recall rate and precision of all algorithms are displayed in figure 7, in which each color represents an algorithm. In the figure, the dotted line represents the recall rate, and the solid line represents the precision.
From the results presented in the figure, it is clear that WEDA shows the best overall quality. Under the condition that the recall rate is 1, the precision of WEDA outperforms all other algorithms. In all compared algorithms, the precision of BSVM, the yellow dotted line, reaches 1, but the recall  rate is very low, which indicates that it only finds a small amount of data with the Hα emission line; the result for the MSSVM is the same. These compared algorithms only detect particularly obvious data for which the Hα emission line's confidence is high. For data that need to use secondary information, such algorithms can ignore these data and be unable to judge them.
For other algorithms, their recall rate can reach a satisfactory level, but they have poor precision. These algorithms use features of minority data and oversample a small amount of the data. Although data recall rates increase, data precision decreases. For spectral data including stars and galaxies with the Hα emission line, these models are unsuitable.
As more and more data are input, the dataset becomes increasingly complicated. The recall rate is unable to reach 1 and the precision also remains low, which makes WEDA inevitably worse, but its recall rate and precision remain at a good level.  multitrain have a very high time cost in a small dataset and their running time does not change much with an increase in the amount of data. This is mainly related to the number of iterations of the algorithms themselves rather than the amount of data. The remaining three algorithms' time complexity can be affected by the amount of data. Their running time on all data sets is lower than that of the above four algorithms. The figure clearly shows that our algorithm has the lowest running time. The stage of data preprocessing in our algorithm determines many data points and reduces the amount of data. Although WEDA processes data in multiple iterations, the amount of data that can put into WEDA is few and the time required for the process decreases. Our algorithm's running time increases when the amount of data increases, but the magnitude of our algorithm's increase is minimal.

V. DISCUSSION
In this paper, we proposed a novel ranking algorithm called WEDA. WEDA uses changeable weights instead of fixed weights to adapt to complicated data and determine data that contain the Hα emission line in different situations. Both weight update and initialization are based on data rather than artificial settings to adapt to different data sets. In addition, we use this algorithm to detect Hα emission line in star and galaxy data, and its quality is confirmed very well compared with other algorithms by experiments; it is more effective than other algorithms in detecting Hα emission line. Regarding maintaining a high recall rate, the precision of this algorithm can also reach a satisfactory level. Even if the amount of data increases, the overall quality of this algorithm will not become very poor. The experimental results indicate that WEDA can be used to detect Hα emission line in star and galaxy data. In the future, we plan to apply it to large-scale data to detect Hα emission line and make the algorithm applicable to other types of data.
HAIFENG YANG is currently a Professor of computer application technology with the Taiyuan University of Science and Technology, Taiyuan, China. His research concerns the data mining and machine learning methods in the specific backgrounds especially for the astronomical big data. He is the long-term member of the Institute for Intelligent Information and Data Mining. He is a member of China Computer Federation (CCF) and Chinese Astronomical Society (CAS).
JIANGHUI CAI is currently the Chief Professor of computer application technology with the Taiyuan University of Science and Technology, Taiyuan, China. His research interests include the data mining and machine learning methods in specific backgrounds of astronomical informatics, seismology, and mechanical engineering. He is the long-term member of the Institute for Intelligent Information and Data Mining. He is a Senior Member of China Computer Federation (CCF).
XUJUN ZHAO received the M.S. degree in computer science and technology from the Taiyuan University of Technology, China. He is currently pursuing the Ph.D. degree with the Taiyuan University of Science and Technology. His research interests include data mining and parallel computing.
YALING XUN received the B.S. degree in computer science and technology from the Harbin University of Science and Technology (HUST), and the M.S. and Ph.D. degrees from the Taiyuan University of Science and Technology (TYUST). She is currently an Associate Professor with the School of Computer Science and Technology, TYUST. Her research interests include data mining and parallel computing. She is a member of China Computer Federation (CCF).
CAIXIA QU is currently pursuing the master's degree with the Taiyuan University of Science and Technology (TYUST), Taiyuan, China. Her current research interests include data mining and paralleling computing.