Hybrid Heuristic Algorithm Based On Improved Rules & Reinforcement Learning for 2D Strip Packing Problem

A hybrid heuristic algorithm based on improved rules and reinforcement learning is proposed to solve the 2D strip packing problem (2DSPP). Firstly, the scoring rules based on the skyline algorithm are improved by considering the “two-step” successive items with a set of width relaxation factors. The improved scoring rules can reduce space waste efficiently. Secondly, as a reinforcement learning approach, the Deep Q-Network (DQN) is established to get the initial rectangular items sequence and at the same time as an essential supplement for the placement rules. It can improve space utilization and prevent the algorithm from falling into the local optimum. Combining the new “two-step” placement rules and DQN, the heuristic algorithm based on simple random algorithm (SRA) is proposed and finally called reinforcement learning based simple random algorithm (RSRA). Experiments on eight datasets by five algorithms have been conducted for comparison. Results show the RSRA has achieved the best performance on eight datasets (C, N, CX, NT, 2sp, NP, ZDF, BWMV) and has dropped Ave. Gap% by 45.86%, 45.16%, 30.89% and 20.56% than GRASP, SRA, IA, ISH respectively. It can be concluded that the RSRA algorithm would achieve better performance than the other four algorithms on eight datasets, especially on the larger datasets.


I. INTRODUCTION
As a typical combination optimization problem, the Packing Problem has been proven to be an NP-hard problem [1]. Packing problems with different constraints and objectives are widely used in the manufacturing, transportation, and computer industry. This paper focuses on the 2DSPP.
A. 2DSPP MATHEMATICAL MODEL 2DSPP in this paper are described as follows: Given a set of the rectangle of size w i × h i , i = 1 · · · n, and a strip with W width and infinite height. Let the lower-left corner of strip be the origin of the two-dimensional coordinate system, and put rectangular objects into the strip. The goal is to use the minimum height h of the strip, and all rectangles in strip meet the following conditions: (1): rectangles cannot be overlapped. (2): the rectangle edges must be parallel to the X -or Y -axis.
(3): each rectangle width cannot exceed the bounds of the strip.
The associate editor coordinating the review of this manuscript and approving it for publication was Diego Oliva . (x i1 , y i1 ) and (x i2 , y i2 ) represent the lower-left corner and the upper right corner of the rectangle respectively. The mathematical description [41] of 2DSPP is as follows: min H s.t. (1) H = max{y i2 , i = 1, 2, . . . , n}; (2) x i2 − x i1 = w i and y i2 − y i1 = y i , i = 1, 2, . . . , n; (3) max{x i1 − x j2 , x j1 − x i2 , y i1 − y j2 , y j1 − y j2 } ≥ 0 i, j = 1, 2, . . . , n, and i = j; (4) 0 ≤ x i2 ≤ (W − w i ) and y i1 ≥ 0, i = 1, 2, . . . , n; The first constraint indicates that the total height of the strip used is labeled as H , the second constraint indicates that the rectangles must be placed horizontally, the third constraint demonstrates that there is no overlap between the rectangles. The fourth constraint means that each rectangle must be in the strip.

B. PREVIOUS WORK
Researchers proposed several algorithms for the problem. Early scholars tried a bunch of exact algorithms. Beasley [2] proposed a tree-searching algorithm. Martello textitet al. [3] used the branch-and-bound algorithm. In 2019 Wei et al. [4] tried to eliminate the conflict between the two items by branching first, and then used dynamic programming to solve the problem of two-constraint backpacks on the leaf node. Experiments have shown the algorithm was superior to the existing branch-and-bound method. Bezerra et al. [5] proposed the use of two mixed integer linear programming model. The results show that the model can produce more optimal solutions. However, exact algorithms are only more suitable for the problem with fewer samples.
Considering the time performance of the precise algorithm, heuristic and meta-heuristic algorithms are also introduced to 2DSPP solving. Meta-heuristic algorithms include genetic algorithms [6], [7], simulated annealing algorithms [8], [9], particle swarm optimization [10], [11]. Classic heuristic algorithms include the Bottom Left [12] algorithm, the Bottom Left [13] algorithm, the Best Fit [14] algorithm, Alvarez-Valdés [15] et al. proposed the Greedy Random Adaptive Search Algorithm (GRASP), which proved to be one of the best algorithms to solve the problem of 2DSPP. The study found that a single heuristic algorithm has the problem of premature convergence and falling into local optimum. On this basis, some scholars have put forward a hybrid heuristic algorithm, which combines the advantages of different heuristic algorithms, and effectively avoids the problems above. For example, Huang et al. [16] and others proposed the lowest level of the optimal fitting algorithm with memory and combined with PSO algorithms. The experiment shows that the hybrid heuristic algorithm is better than the single heuristic algorithm. Rakotonirainy [38] et al. proposed a hybrid approach in which the method of simulated annealing is combined with a heuristic construction algorithm. They also proposed the second algorithm involves application of the method of simulated annealing directly in the space of completely defined packing layouts, without an encoding of solutions. Experiments for the 2DSPP prove that these two algorithms are better than the existing meta-heuristic algorithms.
In addition, there are some excellent heuristic algorithms; for example. In 2011, Leung et al. [17] proposed to use a twostage intelligent search algorithm(ISA) composed of local search(LS()) and simulated annealing. LS() just swaps two positions of items in a given sequence in turn. Then simulated annealing will be implemented to obtain a better solution in global. Yang et al. [18] improved Leung algorithm by replacing the simulated annealing algorithm with a SRA without setting any parameters. In 2016, Wei et al. [39] proposed an efficient improved algorithm (IA) in 2016, adding a greedy selection stage based on Leung et al. [17] and Yang et al. [18], and preferentially selecting a better initial solution to enter the local search phase. In 2017, Wei et al. [40] proposed an improved skyline-based heuristic algorithm (ISH). The ISH with a complexity of O(n · log(n)) has proven to be superior to most heuristic algorithms. Although the heuristic algorithm can quickly obtain an approximate optimal solution, the algorithm is less popular and must be specially designed for different problems.
As an artificial intelligence approach, reinforcement learning (RL) method can directly extract useful information from the data, so that it can potentially learn better heuristic algorithms. RL has been widely used on combination optimization issues such as travelling salesman problem [19], vehicle routing problem [20], Graph Coloring [21], maximum independent set [22], and packing problems, etc. Bello [23] et al. proposed a combination optimization framework based on reinforcement learning, which has yielded good results on traveler issues and backpacking issues. In view of the complexity of the combinatorial optimization problems, the DQN [24] combining the reinforcement learning function and the artificial neural network has demonstrated its strong performance, especially for the Maximum Cut [25], the Maximum Common Subgraph [26] and other combinatorial optimization problems. A deep reinforcement learning framework based on dual DQN has been proposed for an online two-dimensional packing study [27]. Although reinforcement learning has better performance than heuristic algorithms on the problem of combinatorial optimization, the performance is influenced by the amount of training data.

C. OUR WORK
In this paper, a hybrid heuristic algorithm is proposed on DQN and SRA. The main contributions of this paper are as follows: (1) Previous rules have just considered ''single-step'' item placement. In this paper, the ''two-step'' successive items sequence is considered to get better combination by introducing a set of width relaxation factors.
(3) Introduction of another ''Scorer'' based on reinforcement learning. DQN is used to establish the evaluation function to get the placement sequence score of rectangular items. Compared with other simple sequences sorted by the perimeter, width, etc., DQN can not only generate initial sequence, but also improve overall space utilization. Another advantage is preventing the heuristic algorithm from falling into the local optimum and reducing the number of iterations.
(4) Algorithm fusion. The RSRA combining DQN and SRA is proposed without too many parameters in which the scoring rule is determined by combining the improved scoring rules (Scorer I) and DQN (Scorer II).
The remainder of this paper is organized as follows. Part II presents the hybrid heuristic algorithm based on improved scoring rules of skyline algorithm and the evaluation function based on DQN combined with SRA. Part III has shown experiments and analysis. The last part includes a conclusion and a prospect.

II. HYBRID HEURISTIC ALGORITHM BASED ON REINFORCEMENT LEARNING A. SKYLINE ALGORITHM
As shown in Fig.1, the placement space on the lowest and the leftmost candidate line segment s lowy (give priority to VOLUME 8, 2020 ''the lowest'') is labeled as S (in the blue box in Fig. 1). The height difference of the horizontal segment adjacent to s lowy the larger is h 1 and the smaller is h 2 . The width of the s lowy is recorded as s lowy .w. The height and width of the candidate rectangles are recorded as r.h and r.w. The smallest rectangle in the unplaced item sequence is recorded as r min and its width is recorded as r min .w. The space remaining after placing rectangular items in S is recorded as A (in the dotted box in the Fig. 1) and the width is labeled A.w.
The skyline algorithm [28] is used as the basic framework to determine the arrangement rules of rectangular items. The algorithm steps are as follow: for a given rectangular items sequence, repeat the following four steps until all rectangles are placed: 1) Find the s lowy as the initial position to place items; 2) Calculate the fitness value for each item in turn from the original sequence; 3) Place the rectangular item with the highest fitness value in the S; 4) update the skyline.

B. NEW SKYLINE SCORING RULES
The disadvantages of Yang's rules [18] and Gao's rules [29] are as follows: (1) There is no restriction or control over the width of the item to be placed in Yang's rules. So items with smaller widths may be put first, which will waste more space than putting items with larger widths. It is not reasonable that putting items with different widths in the same position getting the same fitness values in Yang's rules.
(2) Gao [29] argued that space waste on width is inevitable or even necessary. A relaxation factor α is introduced specially to ensure a wider item to place first to minimize the width waste. However, there might be such a case that the remaining space after putting the wider item can also be put into one more item with minimum width. (See Fig.2(a)-(b)). So the rules should be improved.
(3) For the height controlling, Gao set items with large height to low fitness values by introducing a relaxation factor in the height dimension. So shorter item would be placed prior to the higher one, and it results in small s lowy (See Fig. 3(a)-(b)). However, the actual testing results on public datasets are not ideal. The main reason is if the following item to place has a bigger height, the final s lowy from the two items may be higher, and that leads to the increase of the overall height finally (See Fig. 3(c)-(d)). Actually, previous rules have just considered ''single-step'' item placement. In this paper, the two successive items are considered together to get a better combination to place by introducing a set of width relaxation factors. The main improvements are as follows: (1) A.w represents the width of the remaining space after placing one item. r min .w is the width of the smallest item from the remaining unplaced items. r min .w ≤ A.w means that r min can also be placed here to further reduce the space waste. As Fig. 2(a)-(b) show, placing r 1 or r 2 will result in the same score. Fig.2(c)-(d) are showing it's evitable to cause a waste on width when r min .w > A.w. In this case, item r 4 with a larger width will be placed priority over r 3 to reduce the width waste.
(2) Since that space waste is unavoidable to some extent, so the wider of the item to place, the higher the fitness value should be. In order to minimize the waste, a set of relaxation factors (α, β, 0 < α < 1, 0 < β < 1) is designed to give different fitness values in multiple width intervals. The optimal values α = 0.2 and β = 0.5 are obtained by a number of experiments.
(3) Height relaxation factor is abandoned to avoid excessive height value after combination placement (Fig. 3). Considering two situations of when r.w = s lowy .w as Fig. 4 , the fitness value of the former is set higher so that it can make the space more flat which is more conducive to place the next item. The overlapping rules in literature [29] are also considered in the new rules.
The details are shown in Table 1 and Figure 4(a)-(f).

C. REINFORCEMENT LEARNING METHODS
Reinforcement learning is based on environmental feedback mechanisms. Through constant interaction with the environment, trial and error, reinforcement learning achieves maximizes the benefits of the overall action. Q-learning algorithm is one of the value-based algorithms in reinforcement learning. Q(s, a) is the reward value that  represents the execution of action a in the case of state s. Q function selects the maximum action execution value in the case of the state s. As a table function in the Q-learning algorithm, Q-table is not suitable for 2DSPP. So the DQN is used instead of Q-table for Q-learning. The principle of the DQN is to use an artificial the neural network to replace the action-value function. The neural network is much more powerful than artificial feature search because it has strong expression ability and can automatically extract features. The network structure used in this paper has been shown in Fig. 5.
Data pre-processing approaches are shown in Table 2. For the input data, h large is calculated from the height difference between r.h and h 1 . h small is calculated from the height difference between r.h and h 2 . W is calculated from the difference between the width of the r.w and the width of the space to be placed s lowy .w. h r is the ratio of r.h to r h max .h which is the maximum height of the remaining items. And w r is the ratio of r.w to the r w max which is the maximum width of the remaining items.  DQN has been used to construct the evaluation function. The experimental results show that the performance of this algorithm is better than the sequence of items sorted by width, length, area, or perimeter under most datasets. Algorithm 1 has shown the details: (1) The first line shows items sequence will be sorted for preliminary selection by an algorithm which is similar to the greedy selection stage in IA [39]: the skyline algorithm will be implemented respectively after VOLUME 8, 2020 generating the origin sequence from FIVE indexes: perimeter, area, width, length (from Scorer I) and the Scorer II. The solution of the smallest height in the FIVE sequences will be saved for the next stage. (2) The second line shows a local search algorithm (LS()) proposed by Leung et al. [17] is used for a further better solution. The LS() will swap the two items in a given sequence in turn and implement the skyline algorithm to get a ''local best'' solution. (3) For the line 3-line 19, the SRA will be used to improve the solution ultimately. It should be noted that line 7 combines Scorer I and Scorer II to make decision: Scorer II will be used first. Scorer II rule will be used only if the scores from Scorer I are equal.
And if the scores from Scorer II are also equal, the items will be placed in order.

III. EXPERIMENTAL ANALYSIS A. NETWORK MODEL TRAINING
For the training of neural network Scorer II, C [30], NT [31], CX [32], BWMV [13], [33], N [14], ZDF [34] and NP [35] and 500 groups of data sets are generated based on the algorithm proposed by Bortfeldt and Gehring [37]. Randomly select 80% of the data as TrainSet and 20% as TestSet. The training process is as follows: (1) At the beginning of the training, randomly select n sets of data from TrainSet to get the TrainSet * as the dataset used in this cycle, and then sequentially pack each set of data in TrainSet * , which is a sequence R composed of rectangular items.
(2) Randomly select a number m from (0, num) in which num is the number of items in R. Use skyline algorithm under Scorer II to pack m items from R. Mark the lowest line here ass lowy . Let k be equal to num − m. A very small k value would result in low credibility of the DQN. Set a threshold t for k.k < t would skip to the next loop. Otherwise the k items at this time are processed according to the data pre-processing method of Table 2 and recorded as x j .
(3) Try to place each x j (j = 1 . . . k) at s lowy to execute the skyline algorithm and get the final height h j . Record the maximum height h max and the minimum height h min .
(4) Use the equation y j = h max −h j h max −h min to calculate y j from each x j . Save all x j and y j .
(5) Finally, use the saved x j as input and y j as the target value to train the Scorer II. The training would end when the set time or the optimal effect has been achieved. The detail is described as Algorithm 2.

B. EXPERIMENTAL DATA AND EXPERIMENTAL ENVIRONMENT
Current excellent algorithms including GRASP [15], SRA [18], IA [39] and ISH [40] have been compared with RSRA in order to verify the performance. At the same time, according to the method used by Lenug et al. [17], the problem can be subdivided into two: the zero-waste problem and non-zero waste problem. The optimal solution to the zero-waste problem is already known, while the optimal solution to the non-zero waste problem is mostly unknown which will generally contain some waste space. The C, N, CX and NT datasets are used for the zero-waste problem. And the non-zero waste problem will be tried to be solved on the 2sp [2], [32], [36], NP, ZDF, and BWMV datasets.

C. EXPERIMENTAL RESULT
All instances of the eight datasets and the best solutions from the RSRA can be found in the website (https://github.com/ Gitlixiangdong/BestSolutios). Tables 3-10 in the Appendix Algorithm 2 Network Model Training Input: TrainSet Output: Scorer II 1: Random initialization neural network Scorer II 2: While No longer than the required program run time or the desired results do 3: Randomly select n sets of data from TrainSet as TrainSet * 4: for i←1 to n do 5: num ← number of items in R 7: m ← randomint(0,num) 8: Scorer II is used as a scoring rule to pack m rectangular items in R by skyline algorithm 9: k ←num-m 10: if k< t then 11: Continue 12: end if 13: for j← 1 to k do 14: Put the j-th item in the sequence of k items into s lowy 15: Let the input of item j and the current position data be preprocessed as x j 16: Continue with the skyline algorithm until all rectangular items are packed and get h j 17: end for 18: for j← 1 to k do 20: Save x j and y j to x, y 22: end for 23: Train the Scorer II based on the obtained x as input and y as output 24: end while 25: return Scorer II    BestH is the best height found in one run. Ave. Gap% and Best. Gap% denote the average Gap and the best (smallest) Gap during 10 runs for each instance respectively [18].    algorithms. For the Ave. Gap%, the RSRA algorithm has the largest number of optimal solutions on 6 datasets (C, CX, NT, 2sp, NP, BWMV) out of 8 ones, and has the largest number of optimal solutions in SUM. For the Best. Gap%, the RSRA algorithm has the largest number of optimal solutions on 5 datasets (C, N, NP, ZDF, BWMV) out of 8 ones, and has the largest SUM number of optimal solutions in all the five algorithms.
The results can be analyzed further according to the size of the dataset. As shown in Fig.6, even though the results VOLUME 8, 2020  of RSRA are much better than that of the other four algorithms, they are very close to that of GRASP and ISH. This is probably because the dataset of 2sp is so small (most of the instances are with n ≤ 50). The CX and ZDF datasets have contained of instances with n ≥ 10, 000. And the performance of the RSRA seems much better than the other four algorithms. The count numbers of problem instances in other five datasets (C, N, NT, NP, BWMV) are mostly with 50 ≤ n ≤ 200. RSRA algorithm is also effective on datasets with medium problem size. Therefore, it can be concluded  that the RSRA algorithm would achieve better performance than the other four algorithms on eight datasets, especially on the relatively large datasets.
The RSRA algorithm has also been compared with the SPSAL and IAm algorithms proposed by Rakotonirainy et al. [38] in 2020. All the instances used in experiments are from the benchmark clusters proposed by Van Vuuren and Rakotonirainy [42]. Cluster1 consists of instances with predominantly narrow items of elongated rectangular shape, which are widely varying in size. Cluster2 mainly contains a number of items with large size and are predominantly homogeneous. Cluster3 mainly consists of approximately square items with different sizes. These items are much smaller than the width of the strip. Cluster4 mainly contains approximately square items with the same size. Table 11 in the Appendix has shown the results from the RSRA compared with the SPSAL, IAm, GRASP, SRA, IA and ISH algorithms. Since Rakotonirainy et al. [38] did not provide the algorithm source code, so we apply RSRA algorithm on the data set from literature [38]. Mean performance ratios achieved by the various algorithms together with their ranks at a 5% level of significance have been shown in parentheses in the table. A rank of 1 indicates that the algorithm achieved the smallest mean packing height over the instances in a cluster of benchmark data. The other algorithm experiment results come from literature [38], the result data ranking is shown in parentheses. Table 11 has shown that the RSRA has achieved best results on all the four data clusters. Even on the ''complex'' Cluster1, RSRA has achieved very good result. And the performance of RSRA has no statistical difference from IAm on the other three clusters at the significance level of 5%.
Compared to IAm, the advantages of RSRA can be concluded as follow: (1) The scoring rules (Scorer I) used by the RSRA are more refined and reasonable than IAm. This makes it more effective to generate the best rectangular items sequence for placement. (2) DQN was used in RSRA to obtain the evaluation function as Scorer II. So the VOLUME 8, 2020  perimeter, area, width, height and the score from Scorer II are used to implement the skyline algorithm respectively.
And the solution with the smallest height will be saved for the next stage. This ensures that the heuristic algorithm can obtain a better initial solution to improve the overall performance.

IV. CONCLUSION
This paper has proposed a hybrid heuristic algorithm for 2DSPP problem. Firstly, scoring rules based on skyline have been improved to reduce the space waste called Scorer I. Secondly, the reinforcement learning method has been used to enhance the local search ability and reduce the number of iterations called Scorer II. The Scorer I and II have been combined with a SRA to be the RSRA. Experiments show that compared with the other four excellent heuristic algorithms (GRASP, SRA, IA, ISH), the RSRA has achieved the best performance on eight datasets (C, N, CX, NT, 2sp, NP, ZDF, BWMV) and has dropped the Ave. Gap% by 45.86%, 45.16%, 30.89% and 20.56% than GRASP, SRA, IA, ISH respectively. It can be roughly concluded that the RSRA would achieve better performance than the other four algorithms on eight datasets, especially on the relatively large datasets.
Researchers have found reinforcement learning can achieve excellent performance for combinatorial optimization problems [24], even though it also has some limitations. For the running time, RL seems to be more time consuming. Another problem is how to improve the generalization ability. That is, train on a dataset and generalize it to other similar datasets. It's also found that the performance of RL mainly depends on the structure of the neural network. So in the future, we will focus on the neural network structure and training algorithm to enhance the learning ability of RL, and improve the generalization of the algorithm. Tables 3-11. KAI ZHU received the bachelor's degree in information and computing science from the China University of Petroleum (East China), in 2008, and the master's and Ph.D. degrees, in 2011 and 2014, respectively. He is currently a Teacher with the Qingdao University of Technology. His research interests include reinforcement learning, big data analysis, machine learning algorithms, and applications.

See
NAIHUA JI received the bachelor's degree in computer science and technology and the master's degree in software and theory from Yanshan University, in 1997 and 2004, respectively. He worked as an Intern with Beijing BiXi Radio and Television Technology Company Ltd., for one and a half years. He has been teaching with the Qingdao University of Technology since 2004. His research interests include reinforcement learning, integrated manufacturing, and enterprise informatization.
XIANG DONG LI is currently pursuing the master's degree with the Qingdao University of Technology. His current research interests include machine learning and combinatorial optimization algorithm.