A Text-Granulation Clustering Approach With Semantics for E-Commerce Intelligent Storage Allocation

With the hot rise and increasing popularity of e-commerce and online shopping, exponentially growing orders require to be picked up timely. In the order-based warehouse picking process, more intelligent optimization methods are urgently demanded to meet the maturity of robot technology. Due to the wide variety of SKUs (stock-keeping units) in warehouses, direct analysis based on orders can hardly get strong correlations among SKUs. Since orders and SKUs usually are presented in text form files, this paper proposes a clustering optimization algorithm based on text information granules which makes full use of context information and semantics. Storage assignment is determined by clustering SKUs first and then performing correlation analysis on the clustering results according to orders. The algorithm can give the semantic description of each cluster so that the clustering process is interpretable and transparent. Extensive experiments on a distribution center of a clothing production enterprise indicate 40.5 to 57.9% improvement in the average order picking distance compared with the ABC classification.


I. INTRODUCTION
A warehouse management system is a key element in the intelligence of electronic commerce. The ability of robots to efficiently and quickly collect a large number of orders from a large warehouse will be one of the core competencies of an e-commerce company. The objectives of warehouse optimization include minimizing the distance of picking orders and maximizing the space utilization of the warehouse [1]. A common form of order content demonstrates in table 1, each order contains several purchased items (e.g. goods). A recent survey of warehouses implied that more than 70% of the time was spent picking up orders [2]. To reduce the distance required for order picking, researchers proposed optimization methods from the following four ideas: carefully planned instruction picking path, batch orders, warehouse area division, and product allocation to the appropriate storage location [3].
In recent years, many order-based warehousing algorithms have been proposed. Chen et al. [4] proposed the ARBM method, which extracted the relationship between products The associate editor coordinating the review of this manuscript and approving it for publication was Biju Issac . by using association rules. ARBM method can shorten the travel distance by improving the batch processing efficiency of orders. However, Li et al. [5] pointed out that when the order quantity is small, or the warehouse is small, the optimization efficiency of ARBM will reduce. In the same year, Chen and Wu [6] proposed the approach ARIP, which depended a lot on whether there was an association between the orders. Chiang et al. [7] proposed DMSA to reduce the number of stops for picking one order. DMSA is only applicable to reallocate a portion of products [5]. Dijkstra and Roodbergen [8] proposed a dynamic programming (DP) storage strategy based on optimization to solve multiaisle and multi-item problem. Azadeh et al. [9] presented an algorithm based on the genetic algorithm to optimize the allocation of operators in a multi-product assembly shop. However, the full complement of robots has largely replaced human labor. Fontana and Nepomuceno [10] considered a 3D storage location-allocation model based on the ELECTRE TRI method, which realized the classification storage by defining the shelf level and shelf position of each item respectively. Zhou et al. [11] studied a warehouse layout with V-shaped and fishbone channels which removed the original constraint of the channel being a straight line. Other studies by conducting correlation analysis among products for storage location assignment can be found in [12] and [13]. The above algorithms optimize the process of intelligent warehouse pickup from many angles. However, the above algorithms do not make further use of the semantic information of goods, so the whole process is not interpretable. To sum up, this paper proposes a strategy for placing SKUs in large warehouses, and it has semantic meaning.
Most of the existing algorithms perform correlation analysis on SKUs directly based on orders. With the increasing variety of SKUs, the support degree of several SKUs appearing in one order at the same time will be almost submerged, and then correlation analysis can not get sufficient results. Meanwhile, the existing methods do not focus on exploring semantic information about text in the warehousing and picking process. Aiming at the order-based pickup problem of super large warehouses, this paper proposes a text-granulation clustering algorithm (TGCA) method to determine the storage location by clustering based on text information. Then the multilateral affinity among clustering results is calculated. Because of the variety of items in warehouses, order-based correlation analysis between single items becomes ineffective. Starting with a real-life warehouse, the TGCA algorithm proposed in this paper utilizes the basic text information such as commodity name, size, and color to cluster commodities, which not only improves the effectiveness of correlation, but also provides clustering description by EI algebra, facilitating human query and management. The goal of the TGCA model is not to optimize, but to provide a way to extend algorithms limited to small warehouses to large-scale problems that are more suited to actual needs.
The innovation of this paper are summarized as follows: • A text-granulation clustering approach for intelligent storage allocation of large warehouse is proposed.
• The algorithm proposed in this paper is based on semantic learning, so the whole clustering process is interpretable.
• The use of EI algebra provides cluster descriptions for each cluster of SKUs.
Based on the above motivations and ideas, the structure of the paper is summarized as follows: • A literature review of storage allocation approaches (section II) • Definition of the storage assignment and theories used in the algorithm TGCA, including: granulation, EI algebra (section III).
• A detailed introduction of the TGCA algorithm (section IV).
• A case study for warehouse picking as well as experimental results (section V).
• Conclusions and future research plans (section VI).

II. RELATED WORK
Warehouse layout design, as an important part of the order pickup optimization process, has been widely valued by researchers. Zhang et al. [14] proposed a two-layer evolutionary algorithm to solve the problem of automatic warehouse layout design, which is suitable for large-scale warehouse logistics problems. Li et al. [5] proposed a new dynamic storage model that combined the product affinity with the ABC classification. However, this study only considers the two-sided affinity among products, whereas the multilateral affinity needs further study. De Koster et al. [15] optimized the warehousing and picking problem from the perspectives of storage assignment policies and layout design, to improve the overall speed of picking goods. Yang et al. [16] proposed an algorithm for clustering commodities with constraints on storage conditions. The principal component analysis is used to calculate the distance between clusters. Moshref-Javadi et al. [17] first clustered the goods in the warehouse and then discuss the different ways to locate the goods. The clustering methods include principal component analysis, singular value decomposition, and twostep clustering. However, in the process of locating the goods in the warehouse, the turnover rate of goods is not taken into account. Van Gils et al. [18] analyzed the relationships among storage, batch processing, partitioning, and routing and designed the warehouse as a whole to reduce orders pick time. In the paper [19], a warehouse picking algorithm combining storage allocation and travel distance estimation is proposed, which is effective for different routing strategies.
Bozer et al. [20] compared two famous order picking systems mini-load system and Kiva system, and gave the advantages and limitations of each system in combination with the simulation model. Guo et al. [21] made a detailed comparison of four warehouse storage strategies which are storage zoning, random, full turnover-based, and class-based storage. A method of measuring similarity between commodities was proposed to construct a natural clustering model by Jane et al. [22]. Table 2 gives an overview of the algorithms of warehouse layout design. Different from the above algorithms, our TGCA algorithm proposes a clustering algorithm based on semantic learning for placing strategy of warehouse inventory goods, so that ''related'' goods can be placed closer together.

A. STORAGE PROBLEM DESCRIPTION
The core problem studied in this paper is the order-based pickup of the e-commerce platform. When the system receives customers' orders, the robots need to go from an entrance to a warehouse to pick up the corresponding goods according to the requirements on orders. Robots can only pass through the existing horizontal and vertical channels in the warehouse. When robots get all the goods on an order, they will return to the warehouse entrance. We suppose the warehouse has n rows of shelves, each with m storage locations. In this way, the whole warehouse can store m * n kinds of SKUs. Depot is located at the left of the front aisle. Figure 1 shows two usual warehouse picking diagrams for an order, with black squares representing items purchased in the order and white squares representing items not specified. Figure 1(a) shows the shelf arrangement with only vertical aisles, and figure 1(b) shows the multi-row shelf arrangement with both horizontal and vertical aisles. They are the two most common forms of storage shelves placement. The distance between the two storage locations and two parallel aisles are denoted by r and c, respectively. We assume that the robot can pick up items on the shelves on both sides in the middle of an aisle without shifting to the side, which is common in many other hypotheses in literatures (e.g., [23], [24]). For the routing method, we consider the S-shape routing, Return routing, and Midpoint routing [18], [25]- [27].
• S-shape routing: from left to right, the robot traverses the aisle where the shelves need to be picked up. If it enters the current aisle from the front, it enters the next aisle from the back, and vice versa. If the shelves on either side of the aisle do not contain the items in the order, the robot will not enter the aisle (see figure 1).
• Return routing: each time, the robot enters the aisle from the front of the shelf, picks up the farthest item in the current column, and immediately returns to the front of the shelf.
• Midpoint routing: only the first and last aisle will be completed around the perimeter of the shelf. The shelves are divided in the middle into front and back sections.
The data used in this paper are from a real Chinese e-commerce history orders and SKUs warehouse (If it is to deal with English orders and SKUs, it degenerates to the problem without the word cutting process). In this practical problem, there are three attributes for each SKU, which are the name of commodity, size, and color and we use i ] to represent them. Figure 2 illustrates an example of some SKUs. The proposed algorithm can widely use in other e-commerce storage data, such as Amazon and Alibaba, and the attributes can also include brand, model, configuration, grade, pattern, packaging capacity, production date, shelf life, price, origin, etc. For different warehouses with different commodity types, our algorithm has similar principles in selecting commodity attributes. First, prioritize the attributes that are common to most goods. Such attributes are the basis of clustering. Secondly, attributes with semantic information should also be given priority, such as the name of the product, brand, scope of application, etc. In this way, the semantic richness of the cluster description can be guaranteed.
Each order covers the purchased items, as represented in table 1. An order may purchase only one item or more items at the same time. The significance of the proposed algorithm is to determine the position of each SKU on shelves to make the shortest path to pick up items according to a batch of orders.

B. RELEVANT THEORIES
In this part, we give a specific description of the definition and mathematical formulation of relevant theories.

1) GRANULAR COMPUTING
Granular computing (GrC) is a newly proposed research field that covers the world view and methodology of viewing the objective world with different granularity [29]. A granule is a block formed by some entities through indistinct, similar, adjacent, or functional relations. This process of processing information is called information granulation [30]. Information granules can be defined and analyzed in many formal frameworks such as fuzzy sets [31], rough sets [32], and concept lattice [33].
The granular layer refers to all particles with identical granularity. In this paper, text granularity is used to dynamically adjust the number of items contained in the granularity layer to cluster. In the algorithm proposed in this paper, the coarsest granularity is the cutting results of SKU name, color and size attribute, such as w 1 = casual, w 2 = T-shirt, w 3 = printed, w 4 = red, w 5 = XL, w 6 = dress, etc. Finer information granules such as A 1 = w 1 w 4 w 2 and A 2 = w 3 w 5 w 6 , representing ''casual red t-shirts'' and ''printed XL size dresses'', respectively. The specific process will be described in detail in subsection IV-A 2) EI ALGEBRA The most prominent advantage of a text-based clustering algorithm is the change of text information in the clustering process. However, it is unwise to obtain the cluster description by simply and directly combining all SKUs information particles belonging to the same cluster because it will generate a large amount of redundancy and the description will be extremely long. To obtain the most concise description of clustering, we introduce the relevant definitions and properties of EI algebra to solve this problem. EI algebra was first proposed by Liu [34] to deal with fuzzy set relations. It has also achieved good results in combination with particle calculation. The definitions of EI algebra are illustrated by the following examples: where ''+'' denotes the disjunction of granules, A i is same as mentioned in section III-B1, and γ represents ''casual red t-shirts'' or ''printed XL size dresses''. Our text granularity contains clear semantic information, and the coarse granularity is calculated by EI algebra to obtain a finer granularity with richer semantic information. EI algebra also has the following axioms (natural language axioms) [35], [36]: • Commutative law: • Distributive law: The strict mathematical definition of EI algebra is given as follows : Let W is the set of some particles [37], For any i∈I ( w∈A i w), j∈J ( w∈B j w) ∈ EW * ,  where I J is the disjoin union of I and J and if k ∈ I , then The negation of the concept w ∈ W is represented by w and w ∧ w = ∅.
Relevant proof and further derivation can be seen in Kukkurainen and Paavo [39] and Liu et al. [40]. We will give concrete examples of the above two formulas in subsection IV-B.

IV. THE TEXT-GRANULATION CLUSTERING ALGORITHM FOR STORAGE ASSIGNMENT
In the following part, the order-based clustering optimization algorithm will be introduced in detail. Figure 3 illustrates the main steps of the clustering method. Input names, colors, and sizes of all SKUs and the coarse-level information granule are obtained through word cutting. These coarse granularities will be used as the basic units of EI algebra. The upper and lower limits of SKUs number contained in each cluster are determined as α and β, respectively, according to the size of the warehouse. The detail of the above operations and the clustering process will be described in subsection IV-A. The operation of using EI algebra to obtain the simplest description of each cluster is described in subsection IV-B. The rule for placing clusters in a warehouse is that each cluster places in the same row of shelves. Start from the first row and put the cluster with the highest score into the corresponding row shelf according to the scoring formula given in subsection IV-C.

A. TEXT PREPROCESS AND SKUs CLUSTERING BASED ON INFORMATION GRANULES
Due to the large quantity of SKUs and the dispersion of the order content, it is difficult to extract the effective commodity association rules directly between individual SKUs based on the orders. Therefore, the method of dividing SKUs into a uniform number of heaps for correlation analysis proposed. Clustering SKUs with the same attributes or varieties into the same category can achieve the purpose of heap separation and obtain comprehensible and manageable results. Thus, text information such as the name, size, and color of an SKU will serve as an important basis for clustering rules.
Define S as the set of all SKUs. For any given interval [α, β], where α and β ∈ [1, N ] are positive integers and N is the total number of SKUs. S will be divided into K ( K ≤ N ) subsets {s 1 , s 2 , . . . , s K } by clustering. These sets satisfy (8), (9) and (10): where N (s i ) represents the number of SKUs in the subset s i .
Firstly, text preprocessing refers to the cut names of SKUs by Jieba 1 to get the basic particles for clustering. At this point, each sku ij means the jth word in the cut result of the name of the ith SKU. We also ranked each particle from highest to lowest according to its frequency of appearance in all names of SKUs. Then we cluster SKUs with the same result of name cutting as s i . Next, according to the quantity of N (s i ) which is the number of SKUs in the newly cluster, make the following judgment:   than β, extra SKUs will be placed at the end of shelves in the other few clusters. Algorithm 1 gives a first-round clustering process based on names granular. The next part describes the process of obtaining the cluster description. After getting clustering results, figure 5 represents SKUs contained in each cluster and common words of these SKUs. Each entry in figure 5 consists of two lines, with the actual Chinese text on the top and the English translation in brackets on the bottom. This also illustrates that the method proposed in this paper is not limited by language. Next, according to the common words of each cluster, an unique description of each cluster will be given through EI algebra.
There are two clusters in figure 5. Common words of each cluster can be seen as a description of each cluster. However, sku 1 can be shared by both s 1 and s 2 . When the administrator retrieves s 2 using the description ''printed T-shirt : red'', sku 1 will also be retrieved. To solve the problem, we first denoted the description of s 1 as d 1 (all SKUs retrieved by d 1 belong to s 1 ). Then we use d 2 to retrieve SKUs in s 1 and if it retrieves an item, the new description of s 2 can be updated as follows: Until d i can only retrieve SKUs in s i , stop updating d i and simplify d i by natural language axioms. In this way, all clusters get a single and simplified description.
Take the two clusters in figure 5 as an example, SKUs in s 2 cannot be retrieved as the first cluster s 1 , otherwise, the SKU will be first extracted by s 1 in the clustering process. The first cluster description of s 1 defaults to d 1 = [printed personalized T-shirt]. For cluster s 2 , similar to (11), d 2 can be obtained by (5) and (6): where () denotes the negation of the contents and word * (word) = ∅, d 2 means ''printed red T-shirts but not personalized''.
The above procedure ensures that only SKUs in the corresponding cluster can be retrieved for each cluster description, which will provide semantic assistance to the warehousing and refining process.
The cluster description obtained by the logical operation of information granules through EI algebra makes the warehousing, picking up and delivery process of e-commerce have description of text information, so that it is more convenient for people to manage and view the warehouse, SKUs, and orders.

C. RELEVANCE ANALYSIS AND LOCATION CRITERIA
As a simple and effective data analysis technique, correlation analysis can mine the correlation existing in large data and describe the law of some attributes appearing simultaneously in things [41], [42]. Here we use the classical Apriori algorithm to extract the correlation between clusters [43]. Table 3 implies the relationship between the order requirements and the s i cluster. After correlation analysis, the parameters f k and f kl can be obtained, respectively representing the frequency of SKUs in s k appear in all orders and the frequency of two SKUs in s k and s l appear together in all orders at the same time.   Determine the value of α and β according to the length of a row of shelves in the warehouse, and all SKUs will be clustered with quantities in [α, β] to ensure that each column can store a cluster of SKUs. For the type of shelves shown in figure 1(a), first place the most frequently purchased cluster of all orders in the first column on the left. For the type in figure 1(b), the first two clusters with the highest purchase frequency of all orders are placed in the front and back first columns on the far left. The following formula is given to calculate the degree to which cluster belongs to the ith column: where N i represents the index set of clusters already placed in the warehouse. We assume the cluster s l has been stored in the jth column, and if s k will be stored in the ith column, d kl can be calculated by: After calculating the score of each cluster in column i, the s k with the highest score of G i k will be selected and put in the jth column (in the case of figure 1(b), the scores of the front and back shelves are calculated separately). Figure 6 shows the warehouse schematic diagram of correlation analysis according to clustering results. The cluster whose clustering result quantity is less than α will be placed in the remaining position at the end of each shelf.
Formula 12 is composed of two parts. The larger the value of f k in the first part indicates that such items are purchased more often and should be placed on shelves closer to the depot. The other part of the formula measures the relevance of the current cluster to the SKUs already stored in the warehouse and takes into account the effect of shelf distance. Cluster-based commodity storage methods also include within-aisle, across-aisle, and diagonal which are introduced in [21], [28]. We will further analyze the advantages and disadvantages of each method in combination with the experimental results in the comparative experiment.

V. EXPERIMENTAL SETUP A. COMPARATIVE EXPERIMENTAL RESULTS
The total amount of actual data used in this paper includes 11,768 SKUs and the number of orders is 26,900. Each order contains an average of 2 items. Each order contains information about the name, color, and size of items purchased. In the experiment, we set r = 1 and c = 1, indicating that spacings between shelves are 1 m. We adopt the more common S-shape routing method as a basic routing method for order pickup. We also combine our TGCA algorithm with Return routing and Midpoint routing respectively for comparison experiments. For cluster-based storage placement, in addition to using formula 12, we also compared it with Across-aisle and Within-aisle methods.
To test the effectiveness of the proposed method on different scale warehouses, the following experiments conducted.    We designed four warehouses with sizes ranging from small to large in table 4 to evaluate the universality of the algorithm from multiple perspectives. Warehouses 1 and 2 belong to smaller warehouses, while warehouses 3 and 4 belong to medium and large warehouses. We set each robot completes 10 orders at a time and returns to the depot. Table 5 indicates the results of three different algorithms and our TGCA algorithm with three different routing methods under different parameters. In the TGCA algorithm, the values of α and β are in parentheses following the warehouse name. Values in Table 5 represent the average distance the robot needs to travel for each order. Experimental results show that compared with the ABC classification algorithm, ARBM, and PCA-cluster, TGCA algorithms have the shortest pickup distance in all kinds of scale warehouse. Meanwhile, different routing methods also influence experimental results. In the four warehouses of different sizes, Return routing achieves the shortest distances. Table 6 shows the results of the proposed TGCA algorithm in different assign clusters methods. All routing methods in the experiment adopt S-shape routing. Experimental results show that the Within-aisle has the best effect.

B. PARAMETER SELECTION
The parameters to be discussed experimentally are α and β, two important parameters of clustering, determine the structure of the warehouse. Table 7 illustrates the influence of values of α and β on the effectiveness of the proposed TGCA algorithm (this experiment is based on shelves placement in Figure 1(b)). The experiment is based on warehouse 4, with a total of 23,420 orders containing 10,000 SKUs. The bold number represents the minimum average distance traveled to take an order. Experimental results show that the selection of appropriate α and β can effectively reduce the picking distance. This paper only discusses values of α and β that are multiples of 10. A more detailed value may get a better result. For the actual data used in this paper, the best results are obtained when α = 40 and β = 60. This means that the size of each cluster is [40,60]. In the experimental results, the minimum distance of 120.65 is nearly 10 meters less than the maximum distance of 130.26, which means that the robot can save 10 meters for each order. That also indicates that adjusting the value of parameters α and β can effectively improve the algorithm results. When the number of orders is very large, the cost savings are quite objective. On the other hand, in the TGCA algorithm, even the most unsatisfactory experimental results reduce the distance by 36.15% compared with the ABC classification algorithm, which shows the validity of the TGCA algorithm.

VI. CONCLUSION
With the ubiquitous existence of big data, this paper proposes a strategy to reduce the cost of picking up orders based on text analysis. The proposed TGCA method clustered SKUs with the same information granules into the same cluster for further order-based correlation analysis. Meanwhile, EI algebra is used to give a unique cluster description of each cluster. We defined formula 12 to determine the placement of each cluster. It further extends the association analysis, where the location of each cluster is determined by the location of the other clusters.
The context-based clustering method can widely use in various industrial activities. Text description, along with the industrial activity process makes the process, with semantic information, thus letting the whole activity easier for people to understand and manage in improving efficiency.