By Topic

• Abstract

SECTION I

## INTRODUCTION

OWING to many-core processor technologies and increased memory capacity, the hardware of commodity servers has become much more powerful than it was several years ago [1]. To fully exploit such hardware power, operating systems (OSs) have to provide concurrent processing mechanisms and resource management in accordance with the server workload. For example, Boyd-Wickizer et al. showed that application-oriented OS modifications are necessary to scale the Linux OS [2] performance up to 48 processor cores [3]. Furthermore, Blagodurov et al. [4], [5] showed that controlling running cores of processes is necessary to prevent access contention of the processor cache in multicore environments.

The demand to use low-cost commodity servers as enterprise file servers (namely, file servers used in enterprise environments) have recently becomes a fact. To satisfy the requirement of enterprise file servers with high performance, it is necessary to develop an OS that suits the file server workload characteristics. This paper proposes modifications of open source Linux OS (hereafter, “Linux”) to solve the performance bottleneck problem commonly found in enterprise file server environments.

Enterprise file servers have to handle a large quantity of file-processing requests from hosts. In addition, they have to execute batch processing, such as backup of a large number of stored files. As a result, the performance requirements of the OS running on enterprise file servers are high throughput and low latency. Because Linux processes network-file-system (NFS) and common-Internet-file-system (CIFS) requests concurrently, throughput of a Linux file server can generally be scaled up by increasing the number of processor cores. With respect to latency, solid-state disks (SSDs) have become available and make it possible to reduce disk-seek time, which is the main source of the latency problem. On the other hand, the processing time of resource management (such as managing memory) tends to increase due to increasing the amount of resources on modern systems. In enterprise file server environments, the use of file-data and file-metadata caches increase in accordance with memory capacity which also contributes to the increase of cache deallocation time. Especially, if conventional Linux deallocates a large amount of small file-metadata cache, latency of several or several tens of seconds may occur during file-operation processing. Such a long latency degrades usability and causes problems with user applications. Furthermore, if the long latency occurs often, throughput also decreases.

The purpose of our study was to eliminate latency due to deallocation of the metadata cache (hereafter, “metadata cache deallocation latency”). First, the cause of metadata cache deallocation latency was analyzed. The analysis results show that metadata deallocation of conventional Linux results in a performance bottleneck on commodity servers with many processor cores and large memory. A method for metadata cache deallocation, called the “split-reclaim method,” which divides metadata cache deallocation from conventional Linux cache deallocation, was proposed. The effectiveness of the split-reclaim method in comparison with conventional Linux was confirmed experimentally.

SECTION II

## PROBLEM OF CONVENTIONAL LINUX CACHE MANAGEMENT

### A. Cache Management of Conventional Linux

Linux divides physical memory into 4 kB pages and assigns them to kernel and user processes. It uses the remaining pages as a cache of accessed data. In enterprise file server environments, most pages are used as a cache of file-data (page cache) and a cache of file-metadata (metadata cache). An overview of cache management by conventional Linux is shown in Fig. 1.

Fig. 1. Overview of conventional linux cache management.

As for Linux, a page allocator manages pages and assigns them to processes. For small data of less than 4 kB, the slab allocator divides a page into several areas and stores the data as a slab object [6], [7]. The file system stores the page cache directly on pages and stores the metadata cache on the slab objects (metadata objects). In addition, the page deallocation module deallocates unused pages if the number of free pages becomes lower than pre-defined thresholds. There are two kinds of page deallocations, background deallocation and on-demand page deallocation. With background deallocation, kernel daemons (kswapd) deallocate pages as background processes when the number of free pages becomes smaller than the threshold $P_{b}$. If the number of pages decreases further and becomes smaller than $P_{d}$, Linux deallocates pages during page allocation (direct reclaim). With both kswapd and direct reclaim, metadata objects are also deallocated. When a metadata object is deallocated and there are no metadata objects on the same page, Linux deallocates the page.

### B. Problem Concerning Conventional Cache Management in Enterprise File Server Environments

In enterprise file server environments, most of the workload is file access, and both page and metadata caches are often used. Moreover, batch processing, such as differential backup, demands heavy metadata accesses to stored files, whose number ranges from tens of millions to hundreds of millions. Such batch processing uses a much larger amount of metadata cache than servers for other uses. Furthermore, cache use increases with the file operation throughput which increases with the number of processor cores. Because of this large cache use, page allocation speed tends to exceed the deallocation speed of kswapd. As a result, direct reclaim often occurs.

During direct reclaim, Linux delays the processes that are requesting page allocation. Latency due to direct reclaim increases in accordance with page deallocation time because Linux sustains the direct reclaim process until a fixed number of pages are deallocated. Furthermore, page deallocation time depends on the state of memory. In particular, if the metadata cache with relatively small object size increases, page deallocation time tends to increase.

The effect of increasing the amount of metadata cache on deallocation time was experimentally investigated through a case study. In this case study, page deallocation time was evaluated in terms of the accessed file number and memory state. A commodity x86 server with two Intel Xeon 5355 processors (four cores each) and 24-GB memory was used. The number of accessed files started from one hundred thousand and went up to 25 million. First, we measured the page deallocation time after file creation. Then we measured the page deallocation time after random metadata accesses to all created file metadata. The measured deallocation times are plotted as a logarithmic graph in Fig. 2.

Fig. 2. Page deallocation time vs. number of accessed files.

According to the measurements performed after file creation, page deallocation time did not increase substantially even when the number of accessed files increased. On the other hand, in the measurements performed after random metadata accesses, page deallocation time increased dramatically with the number of accessed files. Particularly, when the number of accessed files was more than ten millions, the deallocation time became longer than 1 s. This result indicates that the page deallocation time on Linux increases with the number of accessed files and accesses to the metadata cache. In enterprise file server environments, there are many files, and the metadata of these files are usually accessed repeatedly by users and batch processing. As a result, page deallocation time becomes long and causes large metadata cache deallocation latency.

The transition of the number of metadata objects necessary for page deallocation from metadata cache with file accesses is illustrated in Fig. 3.

Fig. 3. Metadata cache deallocation behavior.
Fig. 4. Implementation of split-reclaim method.

When there are nine metadata objects stored as four objects per page, the number of metadata objects necessary to deallocate a page becomes four after file creation. However, if there are two metadata accesses, the number of required metadata objects becomes eight, namely, double that after file creation.

As aforementioned, when the number of metadata objects and metadata accesses increase, metadata cache deallocation latency becomes an issue. To eliminate this metadata cache deallocation latency, it is necessary to improve the mechanism of metadata cache deallocation during direct reclaim.

SECTION III

## SPLIT-RECLAIM METHOD

### A. Basic Design

To eliminate metadata cache deallocation latency, the “split-reclaim method,” which divides metadata cache deallocation from conventional direct reclaim, is proposed. The split-reclaim method has the following three design features.

#### Abolition of Metadata Cache Deallocation in Direct Reclaim

When metadata objects are dispersed in their LRU list, the metadata cache deallocation time increases and processing time of direct reclaim becomes longer. With the split-reclaim method, direct reclaim is accelerated by abolishing metadata cache deallocation. Furthermore, with the split-reclaim method, kswapd deallocates the metadata cache as before. This is because kswapd is required to adjust the balance of metadata cache and page cache quantities.

If metadata cache deallocation during direct reclaim is abolished, we might run into problems if the page allocation speed of the metadata cache exceeds the page deallocation speed of kswapd. In this case, Linux only deallocates the page cache during direct reclaim, and the amount of metadata cache increases one-sidedly. When the metadata cache occupies the entire memory, Linux fails to perform page allocation during direct reclaim because there are no free pages to allocate. To avoid the memory occupation by the metadata cache, an upper limit on the amount of metadata cache is added. When the amount of metadata cache increases beyond this upper limit, Linux deallocates metadata objects during metadata cache allocation. This processing of metadata is called “metadata reclaim.” The upper limit of the metadata cache is set according to the metadata cache ratio $P$, which is the maximum ratio of the metadata cache to the entire memory. This ratio is considered a tuning parameter, and system administrators can set it according to their target workload. If a system administrator sets $P$ at an appropriate value, Linux can evade page deallocation failures in direct reclaim because the amount of metadata cache is held in check by $P$. Furthermore, when $P$ is decreased, the cache hit rate of the metadata cache decreases.

### B. Implementation

In this implementation of the split-reclaim method, a metadata deallocation module, for monitoring and deallocating metadata cache, is added to the Linux virtual file system layer.

When the metadata deallocation module detects that the number of metadata objects exceeds the threshold $M_{d}$, it activates metadata reclaim. To eliminate the delay with the metadata reclaim, a background metadata cache deallocation daemon, called “iswapd,” is also added. When the number of metadata objects exceeds $M_{b}$, the metadata deallocation module wakes up iswapd in the background, and iswapd deallocates the metadata cache. The values of $M_{d}$ and $M_{b}$ are calculated on the basis of $P$, memory capacity $C$, and metadata object size $O$ with the following formulas: TeX Source \eqalignno{M_{d}=&\,C\times P\div O&\hbox{(1)}\cr M_{b}=&\,M_{d}\times 0.8(coefficient).&\hbox{(2)}}

Linux has two metadata management structures, called “dentry” and “inode.” Basically, these structures are used as a pair to cope with a single file, except files with hard links. To keep the implementation simple, the number of dentry objects is used as the number of metadata objects, and the sum of dentry size and inode size is used as $O$. Table I lists the cache deallocation processes used in conventional Linux and the split-reclaim method.

TABLE I CACHE DEALLOCATION PROCESSING USED IN CONVENTIONAL LINUX AND SPLIT-RECLAIM METHOD

Detailed implementation of the two new deallocation processes, iswapd and metadata reclaim, are explained later.

#### Background Metadata Cache Deallocation Daemon (iswapd)

iswapd is a kernel daemon and deallocates metadata objects. It does not delay other processes because it runs in the background. It deallocates only a fixed number of metadata objects, so page deallocation is not necessary. If the metadata object deallocation speed of iswapd exceeds the metadata object allocation speed, the number of metadata objects does not reach $M_{d}$. As a result, the metadata deallocation module does not activate metadata reclaim and metadata cache deallocation latency does not occur.

Metadata reclaim takes place in metadata object allocation. When metadata object allocation speed exceeds the deallocation speed of iswapd and the number of metadata objects becomes higher than $M_{d}$, the metadata deallocation module activates metadata reclaim. If metadata reclaim is activated, Linux deallocates metadata objects when processes allocate new metadata objects. As with iswapd, metadata reclaim deallocates only a fixed number of metadata objects instead of pages. Even if metadata reclaim causes metadata cache deallocation latency, it is much shorter than that due to conventional direct reclaim. Furthermore, page allocation failures in direct reclaim can be avoided because metadata reclaim suppresses the usage of the metadata cache to $P$.

SECTION IV

## EVALUATION

### A. Environment

To demonstrate the effectiveness of the split-reclaim method, the performances of it and conventional Linux were compared. A metadata-intensive workload and a general NFS file server workload were used for the comparison.

The evaluation environment is summarized in Table II.

TABLE II EVALUATION ENVIRONMENT

The split-reclaim method was implemented on the latest Linux kernel for this evaluation. In the metadata-intensive workload evaluation, one 8-core server with 24-GB memory was used and a batch program imitating a periodical differential backup process was executed. In the differential backup process, backup applications issue stat system calls to check the modified time of stored files. In our batch program, the work threads issue stat system calls to check the modified time of 25.6 million stored files. To imitate the real environment of enterprise file servers, random metadata accesses were executed before the evaluation. In the general NFS workload evaluation, one 32-core server with 64-GB memory was used for a server, and two 8-core servers with 24-GB memory were used for load generators. We used the SPEC SFS3.0 benchmark [8] whose operation mix is based on traces of real-world NFS servers, to imitate the general NFS server workload.

### B. Results

Response times and throughputs of conventional Linux and the split-reclaim method under the metadata-intensive workload were evaluated. In the evaluation, the metadata cache ratio $P$ was set to 40%, 50%, and 60%, and each setting was evaluated.

The worst response times of the two methods for the stat system call are compared in Fig. 5(a). The worst response time with conventional Linux was longer than 10 s when the number of threads was greater than two. On the other hand, the worst response time with the split-reclaim method was kept within 1 s regardless of the value of $P$ and the number of threads. When 16 threads ran and P was set to 50%, the split-reclaim method reduced latency more than 95% in comparison with conventional Linux.

Fig. 5. Comparison of latency under metadata-intensive workload. (a) Worst response time comparison. (b) Average response time comparison.

The average response times of the two methods for a stat system call are compared in Fig. 5(b). The average response times of conventional Linux and the split-reclaim method with $P$ set to 60% increased with the number of threads. On the other hand, the average response times of the split-reclaim method with $P$ set to 40% and 50% increased slightly even when sixteen threads ran concurrently. This happened because the page allocation speed of conventional Linux and the split-reclaim method with $P$ set to 60% exceeded the page deallocation speed of kswapd. As a result, direct reclaim often occurred. On the other hand, the page deallocation speed of the split-reclaim method with $P$ set to 40% and 50% exceeded the page allocation speed because the metadata cache was deallocated by iswapd and metadata reclaim beforehand. This result indicates that the split-reclaim method also improves the average response time if $P$ is low enough.

The throughputs achieved with conventional Linux and the split-reclaim method are compared in Fig. 6.

The throughputs of conventional Linux and the split-reclaim method with P set to 60% did not increase when the number of threads was greater than eight. On the other hand, the throughputs of the split-reclaim method with $P$ set to 40% and 50% improved up to sixteen threads. When $P$ was set to 50%, the throughput of the split-reclaim method became three times higher than that of conventional Linux for sixteen threads. This happened because conventional Linux and the split-reclaim method with $P$ set to 60% used direct reclaim often and the throughput decreased. The throughput of the split-reclaim method with $P$ set to 50% was higher than that with $P$ set to 40%. This happened because the amount of metadata cache was proportional to $P$ under a metadata-intensive workload, and the metadata cache hit rate for $P$ set to 50% was higher than that for $P$ set to 40%. This result indicates that the split-reclaim method can improve the metadata-access performance by setting $P$ to an appropriate value. For the environment and workload used in this evaluation, $P$ set to 50% was the most appropriate value.

#### 2) General NFS-Access Performance

Conventional Linux and the split-reclaim method under a general NFS workload were evaluated. Two load generators issued 60 000 file-operation requests per second for 5 min. The occurrence count and processing time of direct reclaim and metadata reclaim were measured. The value of $P$ was set to 50%. The results of the evaluation are listed in Table III.

TABLE III DIRECT RECLAIM OCCURRENCE AND PROCESSING TIME UNDER GENERAL FILE-SERVER WORKLOAD

The split-reclaim method reduced the frequency of direct reclaim by 13% and the average and worst processing times by more than 99%. In addition, because the ratio of page cache to memory capacity was high under the load of SPEC SFS3.0, the usage of metadata cache did not reach $P$, and metadata reclaim did not occur. It is therefore concluded that under a general NFS workload, the split-reclaim method can eliminate metadata cache deallocation latency.

SECTION V

## RELATED WORK

There have been many studies on application-oriented OS modifications for many-core environments. For example, Boyd-Wickizer et al. investigated the modifications for mail servers, cache servers, web servers, databases, file indexers, and distributed computing in many-core environments [2]. Salomie et al. also investigated modifications for databases [9]. In this paper, an enterprise file server was chosen as the target application.

As for memory management, OS modifications avoiding cache-access contention in many-core environments have been suggested [4], [5], [11], [12]. Inoue et al. further developed a memory-allocation method of the middleware layer to improve web-server performance [13]. However, as far as the authors know, there have been no studies on handling metadata cache deallocation latency in large-metadata cache environments before the present paper.

In the Linux community, many improvements of scalability of many-core environments and memory management have been developed [14], [15], [16], [17], [18], [19]. These improvements were the premises of this study, in which the latest OS containing these improvements was used as conventional Linux in our experiments.

As for OSs other than Linux, Free BSD [20], and Solaris [21] use a different cache management mechanism, which uses a fixed amount of memory for the metadata cache. In these OS environments, delay in metadata cache deallocation does not become a serious problem. On the other hand, unlike Linux, Free BSD/Solaris cannot dynamically increase the amount of metadata cache according to its workload. The proposal method offers advantages for both Linux and these OSs.

SECTION VI

## CONCLUSION

A method, called “split-reclaim method” for eliminating metadata cache deallocation latency was proposed. It was found that the cache management mechanism of conventional Linux causes a large latency if the number of metadata objects is large and the metadata objects disperse in their LRU list. Accordingly, the split-reclaim method divides the metadata cache deallocation from conventional Linux cache deallocation. It was experimentally found that the split-reclaim method can reduce the worst response time by 95% and achieves three times higher throughput in comparison with conventional Linux under a metadata-intensive workload. In addition, under a general NFS workload, the split-reclaim method can reduce the processing time of direct reclaim by more than 99%. These results indicate that the split-reclaim method can eliminate metadata cache deallocation latency and make possible the use of commodity servers as enterprise file servers.

## Footnotes

Corresponding author: T. Fukatani (e-mail: takayuki.fukatani.re@hitachi.com).

## References

No Data Available

## Cited By

No Data Available

None

## Multimedia

No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
ISSN:
None
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available