Consolidating Industrial Small Files Using Robust Graph Clustering

Small file management is widely encountered in industrial areas. Consolidating small files can benefit the performance of the data management system. Many existing consolidation solutions fail to realize the importance of a proper consolidation schema. Therefore, they use very primitive and ineffective schemas. In this article, we focus on proposing an effective and robust consolidation schema. Unlike most of the existing solutions that only focus on the historical workload, we consider the workload uncertainty issue and propose a graph-clustering-based solution that is more robust to workload uncertainty in the future. To do this, we introduce robust optimization, a mathematical model that provides theoretical support for solving uncertainty issues. Then, we demonstrate that the robustness of the consolidation schema can be achieved using a graph clustering algorithm with a duplication mechanism. Since duplication leads to data redundancy, we propose a parameter to control the redundancy and the robustness of the schema. We also propose two algorithms to estimate the parameter automatically. Experimental results on both the synthetic and real-life data sets show the effectiveness of our algorithm.


I. INTRODUCTION
N OWADAYS, management of a large number of small files is widely encountered in the industrial area.For example, a wind turbine generates millions of sensor records every day [1].These data are stored as thousands of small log files whose sizes vary between 10 KB to 50 MB.These files need to be stored in a storage platform so that data scientists can analyze these data to keep track of the condition of the turbines.
Most of the storage platforms are not designed for small file storage.Take HDFS for example, storing these small files directly into the system will cause problems such as high namenode memory consumption and low file transmission rate [2], [3].To properly manage the small files, a widely used solution is to consolidate multiple small files into a large data block [4]- [7].However, once the small files are consolidated, accessing one of the small files within the data block will lead to accessing the whole data block.This feature indicates that the consolidation schema 1 of the small files will have a longterm effect on the performance of data access in the future.

A. Example
Fig. 1 demonstrates the scenario of managing six small log files.The metadata of these files is displayed in the table.Analysts can locate the files they want using these metadata records.In industrial applications, there are mainly two types of analytical workloads, the certain workload and the uncertain workload.The certain workload refers to the queries that will be executed routinely, such as going through a batch of files to monitor the wind condition within a certain time period (Q1 and Q2 in Fig. 1).This workload is certain since we know that it will definitely be executed in the future.The uncertain workload refers to the queries that are not routinely executed.For example, when the analysts find a potential abnormal record, they need to find out if this abnormality is caused by accidental sensor bugs or malfunction of the turbine components.To do this, the analysts need to find other data records showing similar symptoms using various query predicates.For example, if the turbine shows abnormal readings on power output, then the analysts may need to find similar log files using predicates on the turbine's power output (Q4).Other abnormalities may involve queries on wind speed (Q3) or The authors are with the School of Computer Science, Wuhan University, Wuhan 430072, China (e-mail: xavierwang@whu.edu.cn;viking@whu.edu.cn; ywpeng@whu.edu.cn;peng@whu.edu.cn).
Digital Object Identifier 10.1109/TNSE.2022.3195350using multiple predicates (Q5).This workload is uncertain since a) the analysts cannot predict what kind of abnormality will appear in the future and b) there are too many potential query patterns to enumerate.Nevertheless, the importance of the uncertain workload should not be ignored, since the failure of discovering the cause of a malfunction in time could lead to catastrophic results such as burning the motor or even breaking the whole turbine.Generating a consolidation schema for these files is not easy.In a small file management scenario, the files are consolidated before the actual workload is executed (We will explain this process in detail in Section II).Therefore, apart from the certain workload, we do not know which uncertain queries will be executed in the future.Meanwhile, the performance of the future workload is dependent on the consolidation schema.For example, suppose we want to consolidate the files in Fig. 1 using data blocks that can hold at maximum three small files.The certain workload clearly indicates that we should consolidate files 001 to 003 and 004 to 006 accordingly.However, this schema will impact the performance of some of the uncertain queries (Q4 and Q5).On the other hand, if we take all the possible uncertain queries into consideration, we will find that there is no simple schema that can satisfy both the certain and the uncertain workload.

B. Motivation
The example above shows the dilemma of the small file management problem caused by the uncertainty of workloads.If we only consider the certain workload, then the schema will overfit the certain workload and impact the performance of the uncertain workload.If we try to enumerate all the possible queries within the uncertain workload, then the final workload will be so complicated that we may never find an optimal consolidation schema for it.
To deal with the uncertainty issue.A commonly used assumption is that the future workload is very similar to the historical workload.Based on this assumption, many machinelearning-based algorithms are proposed [8]- [10].The problem with this assumption is that the generated schema is likely to overfit the historical workload, thus impacting the performance of the uncertain queries, especially those that are not included in the history workload.
In this article, we aim to design a robust consolidation schema that provides decent performance for both the certain and the uncertain workloads.To achieve this goal, we introduce robust optimization (denoted as RO henceforth), an optimization model that aims for improving the robustness of an optimization plan.The objective of RO is trying to find a schema whose worst performance is optimized within a given uncertainty range, rather than focusing on optimizing towards a fixed input.Therefore, a consolidation schema that meets the RO model will provide decent performance to the uncertain queries, even if they are not included in the historical workload.
The RO model is merely a theoretical guidance, the real challenge is to design an algorithm based on this theory.From the example, we can conclude that a consolidation schema should be dependent on the co-access2 pattern of the small files.A graph model is an ideal choice for describing this kind of pattern.In this way, the traditional small file consolidation problem can be converted to a clustering problem on a graph, with clusters representing consolidated data blocks.In order to generate a robust consolidation schema, we further introduce a duplication mechanism.By duplicating a few nodes during the clustering process, the robustness of the schema can be greatly improved.

C. Our Contributions
The main contributions of this paper can be summarized as follows.
a) We study the industrial small file management problem.This problem is rarely-studied, yet very important for industrial data management.In Sections I and II we point out that most of the existing solutions fail to consider the uncertainty of workload, which limits the effectiveness of the existing solutions.
b) We propose an algorithm capable of generating a consolidation schema that is robust to uncertain workload.This objective is achieved by introducing robust optimization, a mathematical model design for generating a more robust solution for the optimization problems.In Sections III and IV we introduce the principle of our algorithm and the features that enhance the robustness of our algorithm.c) We demonstrate the usability and effectiveness of our algorithm.In Section V we introduce several implementationrelated techniques and the architecture of an exemplar system.In Section VI we verify the effectiveness of our algorithm using various scenarios that are motivated by real-life applications.

II. PRELIMINARIES
Since the solution to the small file management problem involves knowledge across various domains, in this section, we provide the necessary background information and summarize the related work.Notation is provided in Table I.

A. Small File Management
A typical small file management scenario is demonstrated in Fig. 2. When uploading a batch of small files, the first step is metadata extraction (step 1).A typical turbine log file is composed of two parts (demonstrated in Fig. 3), the metadata and the sensor records.The metadata (in red rectangle) is located at the head of the file, it contains the descriptive information of the file.The system extracts the metadata records and stores these records in a relational database like MySQL.The consolidation algorithms use the metadata information to generate consolidation schemas3 (step 2).A consolidation schema is a list of key-value structures, with the key representing the name of a small file and the value representing the name of the data block containing the small file.After the consolidation schema is generated, the small files will be consolidated as data blocks using the schema (step 3).The data blocks will be uploaded to the storage platform such as HDFS (step 4).So far, the uploading process of the small files is complete.
When an analyst performs a query, the query will first be sent to the metadata database (step 5).The database will return a list of small files whose metadata information satisfies the query predicates.Since the small files are consolidated, the system will refer to the consolidation schema (step 6) to locate all the data blocks that contain the requested small files (step 7).Then, these data blocks will be downloaded from the platform (step 8) and the requested small files can be extracted (step 9).
From the scenario above we can see the importance of the consolidation schema.It directly influences the time consumption of the downloading process (step 8).In application, the time consumption of querying the database and scanning the consolidation schema only costs a few seconds.Meanwhile, downloading the requested data blocks might take several minutes or even hours (this phenomenon can be observed in Section VI).
Unfortunately, most of the existing small file management approaches fail to realize the importance of the uncertain workload, therefore they use very primitive consolidation schemas such as a sequential merge [11].More advanced solutions consolidate small files using the distinctive features of the files, such as the geographical location for geo-data [12] or the type of the illness for the smart health systems [13], [14].These consolidation schemas can support only a few dominant query patterns.The most promising solutions use machine-learning algorithms and train the schema by analyzing the historical workload [8]- [10], [15], [16].These approaches perform relatively well if the future workload remains unchanged, yet they are vulnerable to uncertain queries.

B. Graph Clustering
The graph model is well-known for representing the connections between entities, and it is widely used to solve data management problems, such as data partitioning [17] and load balancing [18].Inspired by these articles, we adopt the graph model to solve the small file management problem.By turning small files into nodes and data access patterns into edges, the consolidation problem naturally becomes a graph clustering problem.
Graph clustering is a classical machine-learning problem that has been studied for decades.We roughly summarize the existing graph clustering approaches as follows.
Hierarchical clustering is one of the most instinctive solutions to the graph clustering problem.The principle of this solution is that it recognizes clusters by constructing a dendrogram where the graph nodes are represented using leaves and tree nodes.The related approaches can be further categorized into two types, the divisive approaches and the agglomerative approaches.The divisive approaches start from the whole graph and gradually split it into smaller clusters using edge  cuts or hyperplanes until the clusters are formed [19]- [22].The agglomerative approaches build clusters by iteratively merging the most correlated nodes or subgraphs until the clusters are formed [23]- [25].A common challenge these approaches face is the efficiency issue, since analyzing a large-scale graph node-by-node is a time-consuming process.Therefore, there are also many approaches that focus on accelerating the hierarchical clustering algorithms [26], [27].
Other clustering approaches accelerate the clustering process using embedding.An embedding-based approach projects the original graph to a low-dimensional space.During this process, the closely connected graph nodes will also locate closer to each other, forming distinctive clusters.This feature makes the embedding-based approaches more efficient and sometimes more effective than the hierarchical clustering approaches.The representative approaches include spectral clustering [28]- [31] and random walk clustering [32]- [34].
Modern graph clustering algorithms need to process very complicated graph models such as an attributed graph or a social network.The complexity of these graphs makes traditional graph clustering approaches ineffective.To solve this problem, more advanced learning models are proposed.The commonly used models include deep learning [35], [36], inductive learning [37]- [39] and evolutionary computing [40]- [42].These approaches are more effective than the traditional graph clustering approaches, especially on complicated graph models.

C. Robust Optimization
Most of the existing machine-learning algorithms train their models using a static input and they try to optimize an objective function based on this input.This methodology could lead to the trained model overfitting the input data, especially the dominant classes within the data.To overcome this drawback.A widely used technique is to adjust the weights of either the data records or the mathematical model of the algorithm [43]- [45].This technique is effective for enhancing the robustness of the algorithm based on the input data.
However, in some scenarios, the data to be managed in the future may contain records whose features are not presented in the training data (such as the uncertain queries in Fig. 1).In this case, the trained model can hardly remain robust, especially when facing these records.To solve this problem, robust optimization is introduced [46].Robust optimization is a mathematical model that deals with uncertainties in the field of optimization theory.We use an example to illustrate how to use RO to solve the uncertainty issue in the field of small file management 4 .Fig. 4 demonstrates a highly abstract example of three different consolidation schemas and their costs.The x-axis represents different workloads and the y-axis represents the cost of the consolidation schemas.Given a workload, say L, the traditional approaches will try to find a schema that minimizes the cost at workload L, in this case, schema 1 will be selected.However, the problem with schema 1 is that it only performs well when the workload is exactly at L due to overfitting, and its performance will degrade significantly when the workload shifts.When the workload shifts to L AE R, the cost of schema 1 will become the worst among all the schemas while schema 2 becomes the best schema.If the workload shifts further to L AE R 0 , then schema 3 becomes the best schema.These observations indicate that schema 2 and 3 are more robust compared to schema 1, since their cost remains reasonably well when the workload shifts.In order to generate a more robust schema, RO proposes a different optimization target: Instead of finding an optimal solution towards a fixed input, it tries to find a solution whose worst performance among an uncertainty range is minimized.Clearly, the choice of the schema depends on the uncertainty range.For example, if the range is set to L AE R, then the schema that satisfies the objective of RO is schema 2, if the range expands to L AE R 0 , then the chosen schema should be schema 3.
Robust optimization has drawn much attention recently.Many articles use RO to solve uncertainty issues in different areas, including data management [46], intelligent control [47], distributed communication protocol [48], energy planning [49], etc.A common feature these articles share is that they all adopt the RO model to increase the robustness of their solutions against uncertainties.However, since these approaches cover a wide range of research fields, it is very hard to find a universal algorithm suitable for all the aforementioned problems.Instead, different articles propose different algorithms based on the RO model.Therefore, in this paper, we also propose an RO-based algorithm for the small file management problem.

III. PRIMITIVE SOLUTION
Since our algorithm is based on a graph model, in this section, we explain how to construct a graph for the small file management problem.In addition, we also propose an algorithm that generates a consolidation schema without considering the RO model.We denote this solution as a primitive solution.A primitive solution is essential since a) it can be used for demonstrating the principles of the graph model and the hierarchical clustering algorithm and b) the implementation of the RObased clustering depends on a primitive solution.

A. Problem Definition
We consider a workload L as a group of queries L ¼ fQ 1 ; Q 2 ; . ..g.We denote a collection of small files as C. A query is the request of a list of small files: Q ¼ ff 1 ; f 2 ; f 3 ; . ..; f i g where f i 2 C stands for the name or path of the ith file that is requested.Notice that the query Q we define here and the query in Fig. 1 are the same query in different forms.The queries displayed in Fig. 1 are SQL-like commands and Q is the result of executing the command on the metadata database (step 5 in Fig. 2).
Given a consolidation schema S, the cost of a query Q can be calculated by counting the number of data blocks accessed when executing Q.If all the files requested by Q are consolidated within a single data block, then the cost of the query is 1.If every file is distributed in different data blocks, then the cost of Q requesting i files will be i.Since the cost of Q is related to the consolidation schema S, therefore we denote the cost of Q as Cost q ðQ; SÞ.In this way, we can calculate the cost of a workload L via Cost l ðL; SÞ ¼

X
Cost q ðQ; SÞ; Q 2 L: With the cost function defined, we can formally define the small file consolidation problem.
Definition 1 (Small file consolidation): Given a workload L, a small file collection C ¼ ff 1 ; f 2 ; . ..; f n g, a block size threshold T .The small file consolidation problem aims to find a consolidation schema S ¼ fb 1 ; b 2 ; . ..; b k g, where b i C represents a data block, so that when L is executed, the cost of L is minimized: The block size threshold is the maximum size of each data block.Take HDFS for example, the maximum size of each data block can be 128 MB, 256 MB, or even higher depending on the user configuration.
The cost function is valid only when all the data blocks are fully utilized.How to fully utilize the space of each data block is also one of the challenges in these data storage systems [50], [51].However, in the small file consolidation problem, the utility of each data block is not a major concern since we can use the small files to fill up the unused space within the data blocks.
Without any constraint, the small file consolidation problem is unsolvable since workload L represents the future workload L f , which is uncertain.To make the problem solvable, a widely used assumption is that the future workload would be identical to the historical workload, denoted as L h , in this case, the problem can be redefined as Equation ( 1) is the optimization target of the primitive small file consolidation problem.In this article, we solve this problem by converting it into a graph clustering problem.

B. Graph Model
We start building the graph model by creating an empty graph G.Then, for each file f i in a new query Q, we try to add a new node v i corresponding to f i into the graph G.Each new node has an initial weight of 1.If graph G already contains a node with the same label v i , we increase the node's weight by 1 instead of creating a new node.When all the files in the query are added to the graph, we start building edges.For each pair of files in the query, we try to add an undirected edge between the corresponding nodes in the graph.The weight of the initial edge is also set to 1. Similarly, if the edge already exists, we increase the weight of the edge by 1 instead of creating a new edge.Fig. 5(a) demonstrates the query graph created using the workload in Fig. 1.The numbers on the edges and in the nodes represent their weights accordingly.
In the graph model, a query involving n files would result in P nÀ1 i¼1 i edges added to the graph.A workload involving thousands of files could result in millions of edges and nodes.Still, a graph of this scale is small enough to fit into the main memory [18].
By converting a workload into a graph model, the small file consolidation problem naturally becomes a graph clustering problem.Each cluster represents a data block and the nodes within the cluster represent the small files within the block.To solve a clustering problem, we must formalize its cost function.However, the calculation of the cost function involves an important merge operation which will be introduced in the following subsection.Therefore, we define the cost function first and explain it in detail later.
Definition 2 (Cost function on graph): Given a workload L on a collection of objects C, let G L be the query graph generate by L on C, S be a consolidation schema on G L .The cost function of such a consolidation schema is denoted as Cost g ðG L ; SÞ.

C. Node Merging
The node merging operation merges two nodes into a new node.The new node should have new weight and new edges connecting to the other nodes.This operation simulates the result of consolidating two small files into a single file.The weight of the new node represents how many times the new file should be accessed if the same workload is executed.The weights of the edges represent the co-access relationship between the new file and other files.
We use an example to demonstrate this operation.Fig. 5 demonstrates a typical node merging process.When merging two nodes v 1 ; v 2 , the weight of the new node should be for the weight of the edge between them.The reason is that wðv 1 ; v 2 Þ represents how many times these two nodes are coaccessed.If we consolidate these two nodes together, the coaccess of both nodes can be completed by doing only one access.Therefore, we need to eliminate the extra access count, which is recorded in the edge weight.
When two nodes are merged, we need to rewire the adjacent edges and recalculate their weights.Specifically, when v 1 and v 2 are merged, any node v that were connected to either v 1 or v 2 must have a new edge connecting to the new node v 0 .The weight of node v remains unchanged, but the weight of the new edge wðv; v 0 Þ needs to be recalculated.If v was connected to only one node, then the new edge weight remains unchanged.If v was connected to both of the nodes, then the calculation of the new edge weight is much more complicated since the old edge could be built by either independent queries (like f001; 003g and f002; 003g) or a combined query (like f001; 002; 003g).In the former case, both of the old edge weights need to be added to the new edge.In the latter case, we only need to count the edge weight once.In a normal query graph, the old edge weight is the hybrid form of both of the cases.To simplify the calculation of the new edge weight, we estimate the new edge weight using the dominant edge weight of the old edges: With the merge operation defined, the cost of a clustering schema on a graph becomes clear.Given a clustering schema, we can iteratively merge the nodes within each cluster until the whole cluster is merged into a single node.This node represents a data block that all the nodes within the partition are consolidated into.In this case, we can calculate CostðG L ; SÞ by adding up the weights of the remaining nodes after the merging process is completed.Fig. 5(c) demonstrates the result of merging the original graph into two nodes.In this case, the overall cost of the final graph is 22.

D. Hierarchical Clustering
While there are many alternatives for solving a graph clustering problem, we believe that a hierarchical clustering algorithm is more suitable for the small file management problem.The reasons are as follows.a) In traditional clustering problems, the quality of a cluster is guaranteed if most of the nodes are assigned correctly.However, in the small file management problem, the quality of a data block is also closely related to the edge nodes of the cluster (we will discuss this property thoroughly in Section IV).This feature limits the utility of the embedding-based algorithms since the properties of the edge nodes may not be well-preserved by the embedding model.b) While a deep-learning model might be able to generate a more effective consolidation schema, it also introduces new challenges such as the choice of a suitable neural network model and the number of layers and neurons.Besides, the efficiency of training a deep-learning model is also debatable.c) Compared with the previous two alternatives, a hierarchical clustering model is more likely to generate a high-quality consolidation schema by carefully analyzing the correlations between all nodes.As for the efficiency issue, we will introduce the techniques we use for processing very large data sets in Section V.
Definition 2 leads to a natural hierarchical solution.We can iteratively merge two nodes connected by the heaviest edge so that the reduction of Cost g ðG L ; SÞ is maximized.Notice that each cluster has a natural upper bound: the block size threshold T .The size of a node is different from the weight of a node.The size of a node is the sum of the sizes of all the files this node represents.So, if merging two nodes would result in a node whose size exceeds the block size threshold T , then we will remove the edge between these two nodes instead of merging them.The node merging process is repeated until no more merge operations can be performed.Eventually, the original graph will become a few isolated nodes with no edges.Each node represents a data block.So far, the consolidation schema can be generated according to the clustering result.The framework of this algorithm is displayed in Algorithm 1.

E. Time Complexity Analysis
Consider a graph G ¼ ðV; EÞ with jV j nodes and jEj edges, in the graph clustering algorithm, we need to iteratively find the edge with the heaviest weight.Besides, for each pair of nodes merged, we need to recalculate the adjacent edges.We Remove e; 6: Else: 7: Merge nodes; 8: Update adjacent edges; 9: End while; 10: Generate a consolidation schema accordingly.assume that each time two nodes are merged, k edges need to be recalculated on average.Then, the time complexity of our hierarchical clustering algorithm is OððjEj þ kÞ Ã jV jÞ.

IV. ROBUST OPTIMIZATION
In this section, we will introduce how to deal with the workload uncertainty issue using Robust Optimization.First, we formulate the RO problem on graph, then we use an example to demonstrate that introducing duplication can improve the robustness of a consolidation schema, finally, we propose an efficient solution that solves the robust clustering problem.

A. Problem Formulation
Previously in Section II-C we have introduced the basic principle of RO.The solution to a RO problem is decided by a key factor, the uncertainty degree R. It is a variable within range ½0; 1 indicating the proportion of workload that would possibly change in the future.R ¼ 0 indicates that the future workload is identical to the historical workload while R ¼ 1 indicates that the future workload is completely untraceable.Notice that R ¼ 1 does not necessarily mean that all future queries must be different from the historical ones, it only means that we do not know what the future workload will be like.
In real life, the users may set R value based on their instinct.For example, if the user is certain that the workload is always the same, then they can set R to 0. If their jobs involve some random queries that are not so often, then they may set R to a small value.However, setting R value in this way is not always practical since sometimes the users may not have the full picture of the difference between the future workload and the historical one.In this case, we also propose two algorithms that can automatically select a proper R value in Section V.In this section, we presume that R has been set.
Given an uncertainty degree R and a workload L, our next task is to define what an uncertainty range L AE R stands for.In our method, the edges in the graph represent the queries, therefore, we can simulate the change of workload by changing the edges in the query graph.To do this, we must first introduce a new concept, the edge operation on a query graph.
Definition 3 (Edge Operation): An edge operation represents applying one of the following operations to a random edge in a graph: REMOVE and ADD.The REMOVE operation represents either reducing an edge's weight by 1 or removing an edge with the weight of 1.The ADD operation represents either increasing an edge's weight by 1 or adding a new edge to the graph with the weight of 1.
Fig. 6 demonstrates a simple example of REMOVE operations and ADD operations.Specifically, 2 REMOVEs transform the first subgraph into the second and 1 ADD transfroms the second into the third.
One may argue that here we only consider the changes on the edge weights but omit the changes on the node weights.The reason is that in our graph model, the correlation between two data nodes is mainly dominated by the weight of the edge between them.Therefore, changing the edge weights is sufficient to simulate the changes in correlations between data nodes.
As we have discussed previously that the uncertainty degree indicates the proportion of workload that would possibly change in the future.In our graph model, the workload change can be represented using the edge operation, which leads to the following definition.
Definition 4 (Uncertainty Degree): Given a workload L, let G L be the query graph generated by L, the uncertainty degree R represents the ratio between the number of edge operations over the sum of all the edge weights in the graph: where jOj represents the number of edge operations and P ðW Þ represents the sum of edge weights within graph G L .
Notice that each edge operation can be randomly applied to any edge within the graph.which means more edge operations will lead to higher uncertainty.This property reflects the randomness of the future workload.
When we apply multiple edge operations to the graph, the result could be a collection of possible query graphs instead of one fixed result.The collection of these query graphs will be the uncertainty range: Definition 5 (Uncertainty Range): Given a workload L, an uncertainty degree R, let G L be the query graph generated by L. The uncertainty range generated by R on G L , denoted as L Þ that represents all the possible outcomes of applying R Á P ðW Þ edge operations to G L .Finally, we can formalize the robust optimization problem to be solved: Definition 6 (Robust Optimization on Graph): Given a workload L on a collection of files C, an uncertainty degree R. Let G L be the query graph generated by L on C, L AE R be the uncertainty range generated by R on G L .Find a consolidation schema S for C so that the following objective is satisfied: The number of the possible query graphs within the uncertainty range could be thousands or more.If we want to find a consolidation schema that is robust to all of these query graphs, we need to repeat the primitive graph clustering algorithm thousands of times correspondingly.The time consumption of this process is intolerable.What we are looking for is an algorithm that can find the robust consolidation schema by executing the clustering process only once.

B. Intuition
We use an example to demonstrate how to solve the RObased consolidation problem in a more efficient way.Fig. 7 demonstrates a simple consolidation scenario.For the sake of simplicity, in this example we assume that there are only two query patterns: Q1 : fa; bg and Q2 : fb; cg.The certain workload contains m Q1 and n Q2.The uncertain workload involves k edge operations, where p Á k of them are Q1 and ð1 À pÞ Á k are Q2 ðp 2 ½0; 1Þ.Suppose the size threshold of the data blocks only allows consolidating two nodes, the primitive consolidation solutions will generate two possible schemas: schema 1 and schema 2. The corresponding costs 5 of these two schemas are demonstrated in Table II.From the results, we can conclude that the actual cost of the future workload is uncertain due to the uncertainty of p.Meanwhile, if we strictly follow the definition of (2), then we will find that the only factor that influences the worst-case cost of the schemas is the certain workload since p is eliminated in the worstcase scenario.However, in the example, the worst-case scenario is highly improbable.To reach the worst case of schema 2, all the uncertain queries must be Q1 ðp ¼ 1Þ.Suppose the probabilities of the two queries are equal, then the probability of the worst-case will be 1=2 k .Therefore, from a practical point of view, neither schema 1 nor schema 2 is robust enough for the future workload, even if we choose it based on the definition of RO.
In this article, we propose a consolidation schema that can remain robust for both the real cost and the worst-case cost.Our solution is to introduce duplication.By duplicating node b, we propose consolidation schema 3. Calculation shows that both the real cost and the worst cost of schema 3 is always smaller than the previous two schemas.Therefore, according to Definition 6, schema 3 is more robust than schema 1 and 2. This conclusion indicates that introducing duplication can make the consolidation schema more robust.
However, the size threshold of data blocks limits the number of small files that can be duplicated.Therefore, we need a metric to decide whether we should duplicate a node or not.To do this, we introduce a new metric: According to Definition 7, a large BðoÞ indicates that o is worth duplicating since it reduces the total cost by a large magnitude.Again, we use Fig. 7 to demonstrate the character of a node worth duplicating.For the sake of simplicity, we assume that m þ n ¼ C; m n.In this case, we know that schema 2 should be the primitive schema.As a result, BðbÞ can be calculated via (3): Since BðbÞ is always positive, maximizing BðbÞ is equivalent to minimizing 1=BðbÞ, which leads to the following objective: If we consider p as a constant, to achieve (4), we need to maximize m.However, we have defined previously that m þ n ¼ C; m n.Therefore, the maximum value of m is m ¼ C=2, when m ¼ n.This conclusion indicates that when a node is more evenly correlated to its adjacent clusters, then it is more worth duplicating.In theory, we can find all the nodes worth duplicating by calculating BðnÞ of each node n 2 G.However, this process is very time consuming, since calculating BðnÞ involves calculating the graph-level cost function.Therefore, in the following sections, we propose an alternative solution that can efficiently sample the nodes that are worth duplicating.

C. Unstable Node
The main challenge of the RO-based consolidation problem is that the uncertain queries will influence the choice of the future consolidation schema.However, not all the schemas will suffer from this problem.Suppose we have an original query graph displayed in Fig. 8(a), and two edge operations are to be performed in the future (for the sake of simplicity we assume that these two operations are ADD operations).Then   7 no matter how these operations are performed, the consolidation schema would always be identical since the operations could never break the priorities of merging the edge weights.This conclusion indicates that b is not worth duplicating, since its connections to the adjacent clusters differ significantly.Things are different in Fig. 8(b) where b is evenly connected to its adjacent clusters.If two ADD operations are applied to edge ða; bÞ, then a and b should be merged since their connection is heavier.If two ADD operations are applied to edge ðb; cÞ, then b and c should be merged.In this case, we cannot determine which two of the nodes should be merged, which makes b a node worth duplicating.In this article, We denote node b in subfigure (b) as an unstable node since it can be merged with either of the neighbouring nodes due to workload uncertainty.

D. Discovery of the Unstable Nodes
The discovery of the unstable nodes is a crucial task.Intuitively, a node is unstable if the possible edge operations in the future will influence its current merge schema.For example, in Fig. 8(b), the original merge schema requires merging b and c together.However, two ADD operations could lead to a and b becoming more closely connected.In this case, node b is unstable.
This intuition shows that the discovery of the unstable nodes is dependent on an existing schema.This can be achieved using the clustering algorithm we proposed previously in Section III.The intuition also implies that an unstable node has to be a marginal node i.e. a node connecting two or more clusters.If a node is in the center of a cluster, then it can only be assigned to this cluster, no matter how the priority of its neighbouring edge weights changes.Based on these conclusions, we propose our solution for judging if a node is unstable or not.
Let k be the number of edge operations to be performed.For each marginal node, we calculate the sum of edge weights connected to each cluster as the connection between this node and the cluster.For simplicity we first consider a node connecting two clusters, we denote the connection between this node and other nodes within the same cluster as conðinÞ, the connection between this node and the other cluster as conðoutÞ.The unstable nodes can be identified using the following definition: Definition 8 (Unstable Node): A node is unstable if ðconðinÞ À conðoutÞÞ k.
Using the definition above, we can filter out the initial collection of unstable nodes by examining all the marginal nodes.However, the algorithm does not stop here.After we have traversed all the marginal nodes and discovered the unstable nodes.The unstable nodes form a new cluster.As a result, some of the nodes that were not marginal may become marginal since they may be connected to the unstable nodes.These nodes should also be examined for stability.So, the discovery of unstable nodes becomes an iterative process.First, we discover the unstable nodes from the initial consolidation schema, this process makes more nodes marginal.Then we check these new marginal nodes to see if these nodes are stable or not.This process is repeated until all the marginal nodes are stable.
Fig. 9 demonstrates this process.Subfigure (a) demonstrates the original consolidation schema with two clusters marked in red and blue.In the first round of the discovery of unstable nodes, we find an unstable node marked in grey in subfigure (b).By marking this node as an unstable node, we need to check its neighbourhood to see if they become unstable too.As a result, we discover two more unstable nodes as is shown in subfigure (c).The unstable node discovery terminates here since all the remaining nodes are stable.
There is still one problem that remains unsolved: the number of edge operations k.In Definition 5 we know that we need to perform in total P ðW Þ Á R edge operations.In a query graph with thousands of edges, even R is set to a small value, such as 10%, the number of edge operations would be hundreds, this number is already larger than most of the edge weights in the graph.If we simply choose P ðW Þ Á R as the number of edge operations, most of the nodes will become unstable.In reality, this result is inappropriate since the probability of all these edge operations being applied to the same edge is extremely small, making this case more like an outlier than a worst-case.Therefore, here we propose an alternative solution.
We solve the problem by considering it as a random walk problem.Knowing that we are going to perform P ðW Þ Á R edge operations in total and we have jEj edges within our graph.For a single edge within the graph, each time it has 1=jEj probability of receiving an edge operation and 1 À 1=jEj probability of remaining unchanged.After P ðW Þ Á R rounds, the expected number k of edge operations to be received by a single edge would be The calculation above is not exactly precise.For example, it fails to consider the case that some edges might be erased by the edge operations.However, the goal here is to estimate the number of worst-case edge operations on a single edge.Thus, in our algorithm, a k generated via (5) would suffice.

E. Robust Clustering
We formally introduce the robust clustering algorithm.The first step is to execute the primitive clustering algorithm we proposed in Section III and generate an initial consolidation schema.Then, we can project the clustering result back onto the original query graph and label each node with its cluster information.Then, we perform unstable node discovery on the labeled query graph.This process will turn some of the nodes Fig. 8.An exemplar small file consolidation scenario with possible consolidation schemas and workloads.into unstable nodes.We repeat the unstable node discovery untill no more unstable nodes can be found.Finally, for each remaining cluster, it iteratively merges the most closely connected unstable node.This process is repeated until the cluster reaches its size threshold.Notice that the merged unstable node will not be removed from the graph, other clusters can also merge these nodes.This process continues until all the clusters reach their size thresholds.After all the clusters reach their size threshold, there may exist some unstable nodes that are not merged by any of the clusters.These nodes represent files that are unsuitable for being assigned to any of the blocks.We can use these nodes to fill up the spaces of the existing clusters or merge these files into a new cluster.Fig. 9(d) demonstrates the result of the robust clustering.As we can see that unstable nodes exist in both of the clusters.The framework of the robust clustering algorithm is displayed in Algorithm 2.

F. Time Complexity Analysis
Consider a graph G ¼ ðV; EÞ with jV j nodes and jEj edges.The hierarchical clustering step of the robust clustering algorithm has the same time complexity as the previous clustering algorithm, which is OððjEj þ kÞ Ã jV jÞ.The best case of unstable node discovery would be OðjEjÞ and the worst case would be OðjE 2 jÞ.On average, the time complexity of the unstable node discovery phase can be estimated as OðlogjEj Á jEjÞ.In the robust clustering phase, we need to compare the connection between each unstable node and all the clusters.Suppose there are c clusters and logjV j unstable nodes, then the time complexity of this phase would be OðlogjV j Á cÞ.Overall, the time complexity of robust clustering is OððjEj þkÞ Ã jV j þ logjEj Á jEj þ logjV j Á cÞ.

V. IMPLEMENTATION
In this section, we will introduce some techniques that can be used to enhance the usability of our algorithm.We also demonstrate an exemplar architecture of a small file management system based on our algorithm.

A. Choosing Parameters
The RO-based consolidation method involves an important parameter, the uncertainty degree R.This parameter controls the robustness of the schema as well as the redundancy of the unstable node.A poor choice of R could either lead to not enough robustness or too much redundancy.Therefore, choosing R is a crucial task [52].
In real life, users may not always have strong confidence in the proper choice of R. Therefore, in this section, we propose two solutions that help estimate parameter R. We also demonstrate how to apply these solutions to real-life scenarios to further enhance the usability of our algorithm.
Solution I. We start by considering a simple scenario as demonstrated in Fig. 1.Suppose we have a collection of turbine log files with two attributes: Power and Time as shown in Fig. 10(a).The routine analysis involves accessing data using the time property.Naturally, the optimal choice of organizing the data would be organizing data by their distribution on the time dimension, which means the time dimension is the component of the routine analysis workload.If the uncertain analysis involves queries on the power dimension, then we should also take the data distribution on the power dimension into consideration when organizing the data.In this case, the component of the hybrid workload should be the principal component on both the power dimension and the time dimension, as shown in the red dotted line in the Fig. 10(a).The angle between the component of the routine analysis and the component of the hybrid workload would be an indicator, measuring  how much the combined workload differs from the routine workload.Therefore, we can use this angle to estimate the R value.Specifically, R can be set to R ¼ u Á 2=p since p=2 is the natural upper bound of the angle between two principle components.
This method can be further generalized.Specifically, we denote the collection of dimensions that routine workload depends on as D h , the dimensions that uncertain workload depends on as D f .The eigenvector of the principal component on a collection of dimensions D as egvðDÞ.Then the uncertainty degree R value can be calculated via: If the uncertain workload involves all the dimensions, or the users do not have any prior information about the uncertain workload, then they may use the principal component of the whole data set as the combined workload, as shown in Fig. 10(b).
Solution II.In some use cases, the users may find that their workload cannot be represented using simple queries on different dimensions, or the users may just want to generate a more robust schema for the existing files without knowing the exact uncertain workload.We provide another solution that can estimate R based on the query graph.This solution is carried out as follows.Given a query graph, we partition the collection of edge weights into two clusters using the k-means algorithm, as shown in Fig. 10(c).We assume that the cluster containing heavier edges P 1 would be the cluster of edges connecting stable nodes.The cluster containing lighter edges P 2 would be the cluster of edges connected to the unstable nodes.In order to make these edges unstable, at least k ¼ minðP 1 Þ À minðP 2 Þ edge operations need to be performed where minðP Þ represents the minimum edge weight within P .Using k and (5), we can calculate the R value in reverse: Notice that this solution may tend to recommend an R larger than the optimal choice.The reason is that by applying kmeans clustering, the algorithm will always try to find some unstable edges, even if the original clustering result is perfectly stable.This phenomenon can be observed in the experimental results demonstrated in Fig. 10.

B. Acceleration
Parallel computing is a widely used technique for accelerating the hierarchical clustering process [53], [54].A key challenge to parallelism is to split the original data set properly.Fortunately, in the small file management problem, we can take advantage of the features of the small files.As we have discussed previously that a large portion of the future workload is the certain workload, which is routinely executed.In our solution, we can split the small files based on the query predicates of the certain workload first.For example, if the certain workload accesses the small files based on time periods (as we demonstrated in Fig. 1), then we can split the small files by their time feature and split them into multiple sets.After that, we can execute the consolidation algorithm on these sets in parallel.
Although a parallel algorithm can significantly increase the efficiency of generating the consolidation schema.We do not recommend splitting the original data into too many subsets since it will impact the quality of the consolidation schema.In a small file management problem, the quality of the consolidation schema weights more than the time consumption of generating the schema because a consolidation schema will have a long-term effect on the data access efficiency of the future workload.Therefore, it is acceptable to spend more time on generating a high-quality consolidation schema.

C. Architecture
We demonstrate the architecture of the small file management system we implemented for the industrial small file storage (Fig. 11).This demonstration can serve as an example for other small file management applications.Our system is composed of three parts, the pre-store pool, the schema management module and the storage platform, their purposes are as follows.
The pre-store pool is basically a very large disk.In real life, the small files are transmitted to our system from the energy companies remotely.Therefore, to apply the small file consolidation algorithm, we need to collect these small files first and store them in the pre-store pool.The size of the pool is determined by the total quantity of data received every day.In our case, the system receives around 200; 000 small files with a total size of 4 TB every day, so we use a 10 TB disk as a pre-store pool.
The schema management module is a relatively high-performance PC with 256 GB RAM and a multi-core CPU.This module has two main functions, managing the consolidation schema and storing the metadata information of the small files.The consolidation manager is responsible for the transformation between the small files and the data blocks, including consolidating the small files, uploading/downloading the data blocks and extracting the requested small files from the data block.Both the consolidation schema and the metadata information are stored in a MySQL database.The consolidation schema and the metadata records of the recent three days are stored as in-memory tables.To increase the efficiency of the small file consolidation algorithm, the small files are split into three different time periods, 6 the consolidation algorithm is executed on these batches in parallel.As for memory consumption, caching three days' data would lead to storing 600; 000 records in memory.Suppose each data record consumes 10 KB7 memory space.The total consumption of the in-memory tables is 6 GB.A graph model of 200; 0008 nodes consume around 15 GB of memory space.When executing the consolidation algorithm, the graph model consumes another 20 GB of memory space.Therefore, in theory, the total memory consumption of the schema management module would be less than 50 GB.Considering other regular PC tasks such as buffering recently used data, in practice, the peak memory usage of the module is around 100 GB.
The storage platform is an HDFS cluster.Since the schema management module has consolidated the small files, we do not make any further changes to the storage platform.The total storage capacity is around 2 PB.The system is configured based on the HDFS documentation 9 .Normally, the system will only store the turbine logs of the last 3 months.The older data will be removed from the system and archived using physical disks.

VI. EXPERIMENTS
In this section, we demonstrate the experimental results to verify the effectiveness of our algorithm.Specifically, we conduct three experiments to evaluate different aspects of our algorithm.In the first experiment, we conduct an overall performance test on two data sets.This experiment aims to evaluate the performance of our consolidation algorithm in different use cases against other baseline methods.In the second experiment, we evaluate the importance of introducing the RO mechanism to the small file consolidation problem.In the third experiment, we evaluate the performance of the solutions we proposed for selecting the parameter R.

A. Experiment Setup
In a storage system, the small file consolidation schema is normally generated by the manager node.Therefore, the algorithms are all implemented on a single node PC with Intel Core i7 2.20 GHz CPU and 32 GB RAM.The file blocks are stored in a remote HDFS cluster with 4 nodes with the block size set to 128 MB.The bandwidth between the client and the HDFS cluster is 50 MB/s.Data Sets.We use two data sets to verify the algorithms' performance.The first data set is the ImageNet [55] data set from Microsoft.The data set contains 9600 pictures of different sizes.Each picture has 25 labels such as black, furry, rectangle etc.The experimental scenario is to simulate training an image classifier using pictures stored on systems such as HDFS.If the analyst wants to train a classifier for different shapes, then the certain analysis would involve queries on dimensions such as rectangle, round, square.
The second data set we use is the actual log files generated by several turbines within one day.The metadata records of these files resemble the example we demonstrate in Fig. 1.The size of each log file is between 1 MB to 50 MB, the total number of the files is 50,000 and the total size of these files is 1 TB.
Criteria.We evaluate the performance of each algorithm using four criteria.The first criterion is the time consumption of generating the consolidation schema.In a data upload sequence, a consolidation schema must be generated in order to consolidate the small files accordingly.The second criterion is the number of data blocks to be accessed.This criterion reflects the cost of each consolidation schema.The third criterion is the time consumption of executing the workload.This process involves downloading the data blocks for each query.Notice that in our experiment we do not consider the local caching mechanism.Therefore, if two queries request the same data block, the data block will be accessed twice.This criterion gives a more accurate estimation of the time efficiency of the consolidation problem.The fourth criterion is the total time consumption.This consumption is composed of two parts: the time consumption of generating the consolidation schema and the time consumption of executing the workload.In our experiment, the workload is executed only once.In real life scenario, the workload may be executed repeatedly so that the time consumption of downloading the data would be magnified even more.
Algorithms.We use five different algorithms for the evaluation: a) The first algorithm is to consolidate small files in sequence.This solution is easy to implement and has been widely used by companies such as the wind energy company we introduced in Section I.
b) The second algorithm is the Amoeba algorithm [15].This algorithm generates a tree-structured schema that splits the original data files into small partitions based on the labels of the files.Unlike the previous algorithm, this algorithm chooses labels without considering the historical workload.Therefore, in theory, this algorithm is more robust to the ad-hoc queries by avoiding overfitting the historical workload.
c) The third algorithm is the algorithm proposed in the cliffguard [46], this work first introduces RO into the database optimization area.However, this approach is a generalized solution for any database optimizer using an iterative training process.Therefore it suffers from time efficiency issues, especially in the small file consolidation problem.
d) The fourth algorithm is the hierarchical clustering algorithm we proposed in Section III.This algorithm serves as a baseline that eliminates the RO mechanism in our solution.
e) The fifth algorithm is the robust clustering algorithm we introduced in Section IV.Notice that we do not implement the parallel version we mentioned in Section V in order to demonstrate the best performance of this algorithm.

B. Overall Performance
1) On ImageNet Data Set: This experiment evaluates the performance of different consolidation algorithms on the ImageNet data set.We use several combinations of training workload and testing workload for the evaluation.
The background of this experiment is to simulate collecting data for training three different image classifiers: shapes, colours and animals.To do this, the user needs to collect and store images with different labels.
There are mainly two types of workload, the certain workload and the uncertain workload.The certain workload represents queries that are closely related to the training task.For example, to train a shape classifier, the user needs to collect images with clean labels, such as rectangle ¼ 1 or round ¼ 1.These queries also form the majority of the workload.
The uncertain workload mainly represents queries that collect data for the validation set.This workload is uncertain due to the following two reasons: a) This workload often involves queries that access files with complicated label relations, adding uncertainty to the data access pattern.b) The user may introduce randomness when choosing the files.For example, the user may need to test if the classifier can discover shapes in controversial images.These tests require images with labels such as rectangle ¼ 1 & round ¼ 1.On top of that, the user may also want to observe how the classifier reacts to images containing irrelevant shapes, such as triangle ¼ 1, this label could be randomly selected among a collection of label candidates.These are examples of uncertain queries.The uncertain queries normally occupy only a small portion of the workload (no more than 30%).To achieve this, only a portion of the files are randomly selected from the candidates that meet the query predicates of the uncertain queries.
The main difference between the certain workload and the uncertain workload is that the data access pattern of the certain workload has few intersections while the data access pattern of the uncertain workload is unpredictable.Training a consolidation schema using the certain workload alone is a relatively simple task while training a schema with two workloads combined is a greater challenge.
To fully evaluate the performance of the algorithms, we further split the experiment into three scenarios: Scenario I. We train each consolidation schema using the certain workload and use the certain workload for testing.This scenario evaluates the performance of each consolidation schema handling simple and identical workloads.
Scenario II.We train the consolidation schema using the certain workload and use the hybrid workload of both the certain and the uncertain workload.This scenario evaluates the impact of the uncertain workload on the file consolidation schemas.This is also the scenario that we focus on in this paper.
Scenario III.We train the consolidation schema using the combination of both the certain and the uncertain workload.However, we only use the certain workload for testing.This scenario demonstrates the impact of the uncertain workload on the performance of the certain workload.
Scenario IV.We train the consolidation schema using the combination of both the certain and the uncertain workload.We test the performance using the combined workload as well.This scenario demonstrates the performance of solving the query uncertainty issue in a primitive way.
The experimental results are shown in Fig. 12.The approaches used for generating the schema are marked in different colours.The R value we use for the robust clustering algorithm is the average value of the results automatically generated by solution 1 and solution 2. From the experimental results we have the following observations: a) All the algorithms except for the sequential algorithm achieve decent performance in scenario I.This result shows that using a proper consolidation schema can greatly improve the efficiency of accessing the small files.From the results, we can also see that the primitive solution we proposed (the blue bar) achieves the best performance among all the algorithms.This phenomenon demonstrates the effectiveness of the graph model we utilized in the algorithm.
b) Among all the schemas, the RO schema we proposed (black bar) achieves the best performance against all the other schemas in scenario II.This result proves the effectiveness of our algorithm.Notice that the Cliffguard algorithm (red bar) also achieves the second-best performance since it also generates the consolidation schema using the RO mechanism.The reason it fails to outperform our algorithm is that it tries to find a robust schema at random while our algorithm uses a heuristic algorithm.This difference makes our algorithm converges much faster than the Cliffguard algorithm.The other algorithms fail to achieve very good performance since they all overfit the historical workload.c) Although, in scenario III, the test workload is the same historical workload as we used in scenario I, all the algorithms except for the robust ones, suffer from certain performance degradation.This phenomenon is caused by the algorithms overfitting the training workload i.e. the combination of the historical and the uncertain workload.This result shows that the small file consolidation problem cannot be solved by simply considering the combination of both the certain and the uncertain workload since it will affect the performance of the certain workload.d) Although the training workload and the testing workload are identical in scenario IV.The primitive algorithms (cyan and blue) still fail to achieve good performance.The reason is that the combined workload is so complicated that these algorithms fail to converge to a relatively good local optimum.The Cliffguard algorithm suffers from the same problem.The complexity of the training workload introduces too many possible candidates within the uncertainty range, therefore the possibility of discovering the optimal schema is also reduced.Meanwhile, the RO schema we proposed performs relatively well.Although its performance is worse compared to itself in scenario II, it still manages to achieve relatively good performance in scenario IV.This result proves the robustness of the algorithm.e) From subfigure (a) we can see that the total time consumption of schema generation varies drastically.However, the time consumption of generating the schema has very little influence on the overall time consumption (subfigure (d)).This phenomenon demonstrates that the time consumption of training the consolidation schema is not the major concern in the small file consolidation problem.instead, a descent consolidation schema would save much more time when downloading the data files.
2) On Turbine Log Data Set: This experiment evaluates the performance of different consolidation schemas on a realworld wind turbine log data set.The certain workload is the routine analytical tasks the company performs each day.This workload aims to calculate statistical information such as average wind speed, average power output and examine the condition of each wind turbine.These queries access the data set according to the following pattern: First, the queries classify all the log files by the turbine ID, then the queries split the log files of each turbine by the hour.After that, analytical scripts are executed on the split log files so that the hourly statistical information is retrieved.Finally, the logs of all the turbines that fail to achieve average power output are collected, these logs will be analyzed by professional mathematical models or sampled by a human expert to check if these turbines are still in good condition.
The uncertain workload is the workload that detects abnormalities.These abnormalities could be accidental equipment malfunction, abnormal sensor readings, etc.These abnormalities can be detected using predicates such as status ¼ error (equipment malfunction), windspeed > 5 & power output < 500 (low power output under good wind condition) and windspeed > 100 (sensor data exceeding theoretical threshold).These queries are uncertain since these problems may appear randomly at any time on any turbine.
We evaluate the performance of the consolidation schemas using the same scenario setup and the algorithms we have introduced in the previous section.In this section, the historical workload and the uncertain workload are the workloads collected from real-world applications.
The experimental results are shown in Fig. 13.The four subfigures demonstrate the total number of accessed data blocks, the time consumption of generating the consolidation schema and the total time consumption of both generating the schema and fetching the data files.From the results, we have the following observations: a) The sequential algorithm and the two graph-based consolidation algorithms show very similar performance as in the ImageNet data set.The RO schema we proposed manages to achieve the best performance in all but the first scenario.This result shows the effectiveness of the RO schema.
b) The Amoeba algorithm performs worse than it does in the ImageNet data set since it shows larger gaps between the RO schema.The reason is that the attributes of the ImageNet files are categorical.Splitting the files by their category is a natural and effective approach.In the second data set, most of the attributes are numeric and continuous.The Amoeba algorithm splits these attributes mainly based on their values.For example, given a value range ½0; 100, the Amoeba algorithm might consolidate them into 4 blocks with 4 value ranges: ½0; 25Þ; ½25; 50Þ; ½50; 75Þ; ½75; 100.However, this schema may not be the optimal schema since the value could be accessed differently.This is the main reason that leads to the performance degradation of the Amoeba algorithm.
c) The Cliffguard algorithm also suffers from performance degradation.The main reason is that the complexity of the query graph leads to too many candidates within the uncertainty range, making the convergence process very inefficient.In the experiment, the Cliffguard algorithm fails to converge after 1 h.We terminate it at 1 h and use the best schema it discovers as the consolidation schema.

C. Effectiveness of the RO Mechanism
From the previous experimental results, we can see that introducing the RO mechanism does have a positive influence on the final performance.In this section, we perform a detailed evaluation of the effectiveness of the RO mechanism.To do this, we focus on one specific query and gradually increase its complexity.
The experimental setup is as follows.The consolidation schemas we use for comparison are the hierarchical clustering schema and the RO clustering schema.Both of the schemas are trained using the certain workload on the ImageNet data set.We choose four types of query patterns for comparison: a) Queries accessing files with a single positive label, such as rectangle ¼ 1 & round ¼ 0. These queries are examples of the certain workload.b) Queries accessing files with multiple positive labels, such as rectangle ¼ 1 & round ¼ 1.These queries examine the performance of accessing data within the unstable regions within the consolidation schemas.c) Queries accessing data on uncertain dimensions such as triangle ¼ 1.These queries examine the performance of accessing data files using patterns that the consolidation schemas are not designed for.d) Queries randomly accessing any file within the data set.These queries examine the capability of handling a completely random workload.
We gradually increase the number of files within a query from 2 to 30.The reason we choose 30 as the upper bound of the size of a query is that the average size of each image file is around 4 MB, therefore, in theory, 30 image files could just fit into a 128 MB data block.To increase the probability of requesting files from the same data block, we use the query graph model generated during the training process to help to select the candidate files.Specifically, we use the following rules to generate the queries: We randomly choose a file that meets the query predicate as the initial file.When we lengthen the query, we search for the valid file among the neighbours of the chosen files.If no valid file is found among the neighbours, we randomly select a new valid file from the rest of the data set.
Nevertheless, our experimental setup is highly hypothetical.In a real application, most of these queries cannot be completed by accessing only one data block.
The experimental results are demonstrated in Fig. 14.The four subfigures correspond to four types of queries accordingly.In this experiment, we have the following observations: a) Both of the schemas achieve very good performance on the certain queries (subfigure (a)).One may notice that, as the number of requested files grows, the RO-based schema needs to access slightly more data blocks.The reason is that by introducing the RO mechanism, the same amount of files needs to be carried by more data blocks than the hierarchical schema.Therefore, when executing queries that both of the consolidation schemas can handle very well, an RO-based schema would inevitably lead to accessing more data blocks.However, the experimental result also indicates that the difference between the number of accessed data blocks is not very huge.
b) When executing queries with multiple positive labels (subfigure (b)), the RO-based schema achieves a much better performance than the hierarchical schema.Notice that by the definition of the unstable region and the RO consolidation algorithm, files with multiple positive labels are exactly the files that are likely to fall into the unstable regions.In other words, these are the files that are likely to be split into multiple data blocks using the hierarchical approach, while being put together into the same data block due to the duplication mechanism.Therefore, the experimental result shows a big performance gap.
c) The RO-based schema manages to optimize queries on non-routine queries while the hierarchical schema fails (subfigure (c)).The reason is that the duplication mechanism increases the robustness of the consolidation schema.When queries on non-routine dimensions are performed, some of the requested files that are split into different data blocks using hierarchical schemas may be put into the same data block due to duplication, which leads to fewer data block accesses on the RO based schema.d) Both of the schemas fail to optimize the completely random queries.The difference between the random queries and the uncertain queries is that the uncertain queries are still loosely connected to the routine query patterns.In the Image-Net data set, a file has many labels such as shapes, animals, colours etc.Therefore, even the label triangle ¼ 1 is not directly related to labels such as rectangle ¼ 1 or round ¼ 1, they still belong to the same big category Shape.This connection makes queries on non-routine dimensions connect more closely to the routine schema than the completely random queries.
Combining all the previous observations, we have the following conclusions: First, the performance of both of the consolidation schemas is closely related to the connection between the query pattern and the training workload patterns.As the patterns become less connected, the performance of both of the consolidation schemas would inevitably degrade.
Second, the RO-based schema shows promising results on query patterns of type b) and c).These results show the robustness of the RO-based schema.
Third, neither of the consolidation schemas can handle completely random queries.Dealing with completely random queries is still a challenge remaining unsolved.

D. Parameter Evaluation
The choice of the parameter R is crucial to the performance of the robust clustering algorithm.In the previous sections, we proposed two solutions to choose R automatically.In this section, we evaluate the effectiveness of these solutions.
To evaluate the performance of the choice of R, we use various R values on both the ImageNet data set (subfigure (a)) and the Turbine log data set (subfigure (b)).On both of the data sets, the training workload we use is the routine workload and the test workloads are the combination of both the routine workload and the uncertain workload.
The experimental results are shown in Fig. 15.We run the robust clustering algorithm on both using a series of R values, and we test the performance of executing both the routine workload and the uncertain workload.From the results we have the following observations: a) Although neither of the solutions proposes the exact optimal R value, the R values recommended by these two solutions are still very close to the optimal choice.This observation shows the effectiveness of both of the solutions.b) A very large R value does not necessarily mean that the schema will be very robust.Both of the experimental results show that only a properly selected R can lead to a robust schema.After that, increasing the R value would lead to performance degradation.The reason is that a very large R means that a large portion of the data blocks is used for containing the duplicated data, therefore, the effective storage size of the data blocks reduced.c) From subfigure (a) we can see that when R gets too large, the increment of block accesses becomes insignificant.The reason is that apart from the marginal nodes that can be assigned to multiple clusters, the rest of the nodes only have strong connections to the original clusters.Therefore, even if  we mark these nodes as unstable, they will still be absorbed by the clusters they originally belong to.d) In both of the experiments, solution 2 tends to recommend a larger R value.This phenomenon accords with the conclusion we analyzed in Section V.

VII. CONCLUSION
In this article, we study the industrial small file consolidation problem.Small file consolidation is a rarely studied problem, especially in the industrial area.As a result, most of the solutions for industrial small file consolidation are primitive and ineffective.In this article, we focus on generating an algorithm that can remain robust when processing future uncertain queries.To do this, we introduce the RO model, based on this model, we propose a novel clustering algorithm using a duplication mechanism.Our solution can greatly improve the performance of the uncertain queries in the industrial applications, and bring benefit to tasks such as industrial data processing and accident prevention.
In the future, we will focus on improving both the effectiveness and efficiency of our algorithm by combining the advantages of other state-of-the-art approaches such as embedding, deep learning and parallel computing.

Fig. 1 .
Fig. 1.A simple example of turbine log files, their workloads and a possible solution for consolidating these files.

Fig. 2 .
Fig. 2. Demonstration of a typical small file management scenario.

Fig. 3 .
Fig.3.The data structure of a turbine log file.The red rectangle surrounds the metadata information and the green rectangle surrounds the sensor records.The actual file contains thousands of sensor records, we only display a very small part of them.

Fig. 5 .
Fig. 5. Demonstration of building a query graph based on the example in Fig. 1.Subfigure (a) shows the result of executing the certain and the uncertain queries.Subfigure (b) shows the of merging two nodes connected by the heaviest edge.Subfigure (c) shows the result of merging the original graph into two nodes.

Algorithm 1 :
Hierarchical Clustering Input: Given workload L; Collection of objects C; Block size threshold T .Output: Partitioning schema S. Description: 1: Generate query graph G L from L, C; 2: While there are edges in G L : 3: Try to merge the nodes connected by heaviest edge e; 4: If Size(merged node) > T: 5:

Definition 7 (
Benefit of duplication): Given a query graph G, a primitive consolidation schema S, a node within the graph o.Let S D ðoÞ be the consolidation schema generated by duplicating o.The benefit of duplicating o is BðoÞ ¼ Cost g ðG; SÞ À Cost g ðG; S D ðoÞÞ Cost g ðG; SÞ

Fig. 7 .
Fig. 7.An exemplar small file consolidation scenario with possible consolidation schemas and workloads.

Algorithm 2 :
Robust ClusteringInput: R: Uncertainty degree G: Query graph Output: Partitioning schema S Description: 1: Initialize a cluster for unstable nodes U ¼ ;; 2: C = Hierarchical Clustering(G); 3: For each cluster c in C: 4: Find unstable nodes in marginal nodes; 5: Remove unstable nodes from c, add c to U; 6: Repeat step 3-5 until all unstable nodes are found; 7: For each cluster c 8: Find most tightly connected unstable node u 2 U; 9: Absorb u into the cluster; 10: Repeat step 8-9 until c reaches maximum size; 11: Merge the remaining nodes in U into new clusters; 12: Generate consolidation schema accordingly.

Fig. 9 .
Fig. 9. Demonstration of the unstable node discovery process and the robust clustering algorithm.Subfigure (a) shows the initial consolidation schema.Subfigure (b) shows the initial unstable node.Subfigure (c) shows the result of unstable node discovery.Subfigure (d) shows the result of robust clustering on this graph.

Fig. 10 .
Fig.10.Demonstration of solutions for choosing uncertainty degree R. Subfigure (a) demonstrates the solution using principal components on a simple data set.Subfigure (b) demonstrates a more generalized solution.Subfigure (c) demonstrates the solution using the edge weights of the query graph.

Fig.
Fig. The architecture of an exemplar small file management system.

Fig. 12 .
Fig. 12. Experimental results on the ImageNet data set.Subfigure (a) shows the time consumption of generating the consolidation schemas.Subfigure (b) shows the number of the requested data blocks under different workloads.Subfigure (c) shows the time consumption of downloading the data blocks.Subfigure (d) shows the total time consumption of both generating the consolidation schema and transmitting the requested files.

Fig. 13 .
Fig. 13.Experimental results on the turbine logs.Subfigure (a) shows the time consumption of generating the consolidation schemas.Subfigure (b) shows the number of the requested data blocks under different workloads.Subfigure (c) shows the time consumption of downloading the data blocks.Subfigure (d) shows the total time consumption of both generating the consolidation schema and transmitting the requested files.

Fig. 15 .
Fig. 15.Experimental results of different R value.Subfigure (a) shows the result on ImageNet data set and subfigure (b) shows the result on turbine logs.

Fig. 14 .
Fig. 14.Experimental results of four different types of queries on the ImageNet data set.Subfigures (a) to (d) correspond to query types a) to d) introduced in Section VI-C accordingly.

TABLE II THE
COST OF THE SCHEMAS THAT ARE DEMONSTRATED IN FIG.