An Empirical Performance Evaluation of Transactional Solid-State Drives

Solid-state drives (SSDs) have accelerated the architectural evolution of storage systems with several characteristics (e.g., out-of-place update) compared with hard disk drives (HDD). Out-of-place update of SSDs naturally can support transaction mechanism which is commonly used in systems to provide crash consistency. Thus, transactional functionality has been recently implemented inside solid-state drives (SSDs). However, this approach must be re-evaluated for enterprise storage with a standard interface to investigate their benefits in a more realistic and standard fashion. In this article, we explore the implications and challenges of transactional SSDs with different experiments. To evaluate the potential benefit of transactional SSDs, we design and implement the transactional functionality in a Samsung enterprise-class and SATA-based SSD (i.e., SM843TN) called TxSSD. We modify the local file systems (i.e., ext4 and btrfs) and a distributed parallel file system (i.e., Lustre) to utilize TxSSDs. Our modified file systems with TxSSDs provide crash consistency without redundant writes. We evaluate our file systems by using multiple micro and macro benchmarks. We analyze the performance results and demonstrate that TxSSDs may generate an overhead for supporting transactional functionality inside SSD.


I. INTRODUCTION
Flash memory is widely used for storage devices from single to large-scale high performance systems since it provides lower latency, lower power consumption, and higher throughput than hard disk drives (HDDs) [5], [43]. In addition to the advantages, as the cost per byte is falling while the storage capacity is increasing, large-capacity flash memory devices are more commonly employed for high-end desktops and enterprise storage servers. Large-scale datacenters host many simultaneously running applications, cater to many millions of active users, and service billions of transactions daily. Thus, the load on the storage systems in these datacenters has been enormously increasing. Flash-based solid-state drives (SSDs) are attractive solutions to meet these performance demands for large-scale datacenters [8], [21].
The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney.
To match the required load, the storage systems must be designed to scale in performance, and clustering techniques are used to provide scalability. For example, distributed parallel file systems assemble the cluster elements into one large and seamless storage system. The file system ensures that all clients have a consistent view of the file system by handling the locations of the files and transmission of their data. The file system distributes blocks in a file to different storage locations by using a network protocol to provide high scalability. This approach enhances the scalability of the storage systems and provides higher performance to clients. However, scalability and performance can be negatively affected by providing crash consistency to ensure that client data is recovered consistently from a system crash. Thus, the file systems have considered the trade-off between performance and crash consistency to provide better-quality service. Generally, a higher level of consistency makes the file systems more consistent but negatively affects the performance and endurance of flash-based SSDs.
Most distributed parallel file systems, including Lustre [38], Ceph [46], Gluster [10], and HDFS [4], [34], [40] rely on local file systems to support crash consistency. Most local file systems provide crash consistency to applications by using journaling or copy-on-write (CoW) techniques. Journaling file systems, such as ext4 [26], XFS [44], Reis-erFS [13], JFS [15], and NTFS [9], provide transaction processing for atomicity and durability by using write-ahead logging (WAL) [12]. The file systems write data and metadata to the journal area before writing them to the original area. When hardware or software failures occur, the file systems recover the data and metadata by replaying the written journals. CoW file systems (e.g., btrfs [36], LFS [37], and ZFS [7]) also provide transaction processing by using out-of-place update techniques. When data and metadata are written, CoW file systems copy, modify, and flush them when a transaction commits. They leave older versions of the data and metadata in a storage medium until garbage collection is executed to remove them. However, although these techniques provide crash consistency, they reduce the I/O performance since the data and metadata are written twice. It is a challenge to escape the trade-off between crash consistency and performance.
To provide crash consistency without sacrificing performance, previous studies [18], [30], [33] support transaction functionality inside an SSD by using its characteristic (i.e., out-of-place update). They offload the burden of guaranteeing the transactional atomicity from a host system to flash-based SSDs. Since flash-based SSDs do not allow any page to be overwritten in place, a page update leaves the existing page intact and writes the new content into a clean page at another location [18]. This copy-on-write strategy is adopted by most flash-based SSDs. Since the flash-based SSDs internally perform out-of-place updates, transaction processing inside SSDs can alleviate the trade-off between performance and consistency. This article is in line with these previous studies [18], [30], [33] in terms of the study of transactional SSDs. In contrast, we perform extensive performance studies for a transactional SSD which is both enterprise-class and SATA-based for a more realistic and standard fashion.
In this article, we introduce TxSSD, a transactional SSD with an enterprise-class and SATA interface, and evaluate it with diverse file systems and workloads. TxSSD supports transactional functionality inside SSD by using the nature of flash memory. To do this, we modify the flash translation layer (FTL) of TxSSD and explain how we design and implement transactional functionality in a Samsung SM843TN SSD, which is widely used in datacenters due to the low latency and high throughput. To make the file systems exploit TxSSD, we modify the existing local file systems (i.e, ext4 and btrfs) and evaluate the file systems by using file I/O and online transaction processing (OLTP) workloads. Furthermore, we modify a distributed parallel file system (i.e., Lustre) and evaluate the file system by using HPC workloads in a cluster system. TxSSD-aware file systems preserve crash consistency and retain transaction models of the existing file systems without redundant writes for metadata and data. In TxSSD-aware file systems, legacy and new applications benefit from file systems transparently without any modification. With the experimental results, we analyze the performance results and demonstrate that TxSSD can generate an overhead due to support the transactional functionality. We also disclose and analyze the reason for the overhead. In our previous work [41], we focused on the study of local file systems in a single TxSSD. This article extends our scheme to distributed file systems in a cluster environment. To the best of our knowledge, this is the first study that provides a performance evaluation for a distributed parallel file system as well as local file systems on transactional SSDs at enterprise level with a SATA-based interface.
The main contributions from this study are as follows: • We design and implement the transactional functionality in an enterprise-class and SATA-based SSD called TxSSD.
• We modify local file systems and a distributed parallel file system to utilize the benefit of TxSSD.
• We show an empirical evaluation from our comprehensive performance study using diverse workloads.
• We show that transaction support in enterprise-class and SATA-based SSDs can incur an overhead and analyze the reason for the overhead.
The rest of this article is organized as follows. Section II describes the background and motivation. Section III explains the design and implementation. Section IV shows the experimental results. Section V discusses related work. Section VI discusses a summary and implications of our study. Finally, Section VII concludes this article.

II. BACKGROUND AND MOTIVATION A. CRASH CONSISTENCY IN LOCAL FILE SYSTEMS
Local file systems such as ext4 [26], xfs [44], f2fs [20], btrfs [36], etc provide crash consistency by using a variant of write-ahead logging [12] or copy-on-write schemes. In the midst of many file systems, ext4 [26] is the most widely used file system in Linux and more general than other file systems. Ext4 uses a fork of the journaling block device (JBD) called JBD2 which is a variant of write-ahead logging [12].
Ext4 supports three journaling modes such as writeback, ordered, and data journaling mode [16], [45]. Each mode has a different consistency level and different performance according to the level. The writeback mode supports transaction processing only for metadata writes. This mode does not keep the write order between the metadata and data. The ordered mode (default) supports transaction processing only for the metadata writes but supports stronger consistency compared with the writeback mode by keeping the write order between the metadata and data. Before the metadata is written to the journal location, data is written to their original location. The data journaling mode performs transaction VOLUME 8, 2020 processing for both metadata and data. It writes both metadata and data to the journal location before they are written to the original location. This mode supports the strongest consistency with data integrity (crash consistency). However, it shows the lowest performance because of redundant data writes.
Btrfs [36] is a CoW file system. It allows atomic transactions without a separate journal. Btrfs maintains B-trees for both data and metadata and supports two modes such as datacow and nodatacow [36]. The datacow mode (default) performs out-of-place updates for the data and metadata by creating a new version of an extent or a page at a different location, which prevents a partial update on power failures. This mode performs garbage collection for both data and metadata. Meanwhile, the nodatacow mode performs only out-of-place updates and garbage collection for metadata. This supports higher performance compared with the default mode. But, it does not provide crash consistency due to the lack of data integrity.

B. CRASH CONSISTENCY IN DISTRIBUTED PARALLEL FILE SYSTEMS
Distributed parallel file systems, including Ceph, Gluster, and Lustre, distribute file data across multiple servers and support concurrent access by multiple tasks of a parallel application. Among the many distributed parallel file systems, Lustre is generally used for large-scale cluster computing, such as scientific supercomputing and industry. It can be part of clusters with tens of thousands of client nodes, tens of petabytes of storage on hundreds of servers, and a terabyte per second of aggregate I/O throughput. This makes Lustre a popular choice for large datacenters. Thus, the Lustre file system has attracted increasing attention from research and industry communities.
Traditionally, a Lustre file system has three major functional units which are the metadata server (MDS), the object storage server (OSS), and the client. The Lustre file system includes one or more MDSs that have one or more metadata targets (MDTs) that store namespace metadata, such as filenames, directories, access permissions, and file layout. Also, a MDT has a dedicated file system that controls file access and notifies clients the layout of the objects that make up each file. The OSS provides file I/O service and network request handling for one or more object storage targets (OSTs) that store file data. The number of objects per file is configurable by the user and can be tuned to optimize the performance for a given workload. An OST has a dedicated file system that exports an interface for read/write operations. OSTs and MDTs use a variant of ext4 called ldiskfs, which has the same journaling mechanism as ext4.
Lustre presents all Lustre clients with a unified namespace for all of the files and data in the file system using standard POSIX semantics. Lustre allows concurrent and coherent read and write access to the files. When a client tries to read from or write to a file, it performs a filename lookup on the MDS and fetches the file layout from the MDT object for the file. The file layout is stored in a MDT identified by the file identifier (FID), which contains information about where the file data is located on the OST(s). The client then uses this information to perform I/O on the file, directly interacting with the OSS nodes where the objects are stored.

III. DESIGN AND IMPLEMENTATION A. TRANSACTIONAL SOLID-STATE DRIVE
In this section, we explain the design and implementation for a transactional functionality inside SSDs (called TxSSD). We exploit the out-of-place updates of SSDs while considering the features of enterprise and SATA based SSDs. Also, we use the buffer memory with supercapacitors inside the device and choose more efficient data structures for SSD. To support the transactional functionality, TxSSD leverages multiple table-based data structures to store information of transaction in the FTL. TxSSD uses a table per transaction to avoid the overhead from accessing the mapping information of all transactions in a table. Figure 1 depicts FTL mapping tables and transactions inside TxSSD. The FTL in TxSSD has 32 transaction tables (called tx tables) per transaction ID as shown in Figure 1(a). The maximum number of concurrent transactions is limited to 32 due to the restricted size of the memory inside SSD. Each table supports 60,000 entries and each entry is 20 bytes for a logical page number (LPN) and a physical page number (PPN). The required space is 36.6 MiB for the tx tables. The number of tx tables (32) are sufficient to process the transactions of file systems. It is because most file systems such as ext4, btrfs, xfs leverage a single compound transaction scheme which provides two concurrent transactions such as a running transaction and a committing transaction at most [32], [45]. Thus, under existing file systems, we choose the enough number of tx tables and txIDs and so that the transaction overflow never occurs. If a file system supports multiple transactions, TxSSD supports 32 transactions concurrently now. If the number of transaction is larger than the 32 transactions, we can also increase the number of tx tables and TxIDs. In addition, a file system on top of our TxSSD have to commit a transaction before the number of operations reaches the limit because the number of operations per transaction is restricted to 60,000.
We design and implement TxSSD based on SM843TN SSD. To enable transactional functionality, we do not include new SATA commands and instead employ the reserved count field (6 bits) and the NCQ tag field (5 bits) in the existing commands. Table 1 depicts transactional operations supported by TxSSD. We use the first 5 bits of the reserved count field for the txID (called txID field) and the last bit for the commit flag (called commit field). For the transactional read and write operations, we store the transaction ID of the file system to the txID field in the read and write commands of SATA. In case of the commit operation, we use both the txID field and the commit field in the write command of SATA. We extend the flush command of SATA and assign 1 and 2 to the NCQ tag field 1 for abort and recovery operations, respectively, and set the txID field of the flush command.
When a write operation is performed with a new txID, TxSSD stores the mapping information between the LPN and PPN in the tx table. However, TxSSD delays propagating the mapping information to the FTL table. If a page requested by the write operation already exists, TxSSD does not perform garbage collection for the existing page until the commit request arrives. When updating the transaction entries, instead of in-place update, we choose an append operation. It is because the in-place update can incur an overhead due to the memory access time searching for appropriate entries. 2 Thus, if there is more than one update for the identical LPN, TxSSD just appends the mapping information at the end of the tx table instead of searching and updating the previous entry as shown in an example of the tx1 table in Figure 1(a). In term of read operation, when TxSSD receives a read request with a txID, it finds and returns the page according to the corresponding LPN in the tx table. If there are several entries for the LPN, TxSSD returns the most recently updated page for consistency. If a requested LPN cannot be found in the tx table, TxSSD returns the page in the FTL mapping table.
As shown in Figure 1(b), TxSSD propagates the entries in the tx table to the FTL mapping table after receiving a commit request. TxSSD performs garbage collection for the old pages associated with the tx table after the commit operation is 1 The NCQ tag field is not used for the flush operation of the file system. 2 The memory access time for 4 bytes inside SSD is single-digit microseconds.
completed. Unlike previous approaches [18], [27], in our scheme, all entries in the tx table can be remapped without checking whether the entries are active or committed since TxSSD includes each tx table per transaction. During the commit procedure, if there are multiple entries for identical LPNs in the tx table, the mapping information of the last written entry is applied for remapping.
When a power outage occurs during the commit procedure, all information is preserved using the supercapacitor. TxSSD applies the mapping information in the tx table to the FTL mapping table when the commit request is received at TxSSD. In abort operation, TxSSD discards all entries in the given tx table, enabling garbage collection for pages associated with the transaction. The recovery operation discards all uncommitted mapping information in all tx tables after a power outage.

B. ENHANCED EXT4 (E2XT4)
Our enhanced ext4 (e2xt4) file system is based on transaction model (i.e., single compound transaction) and ordered mode of existing ext4. However, e2xt4 does not perform redundant metadata writes to SSD. For metadata, e2xt4 redirects the journaled metadata to its original location and disables the checkpoint. When a transaction commits, e2xt4 performs the transactional write operations for the data and metadata. To do this, when the data and metadata are transferred to storage, e2xt4 obtains the current transaction ID from the existing journaling scheme, remaps the transaction ID to an unused txID of TxSSD, and transfers the transaction ID through I/O descriptions such as BIO structure. Then, e2xt4 issues a flush command, 3 however, TxSSD may not flush all pages from the device buffer to the flash memory 4 due to the supercapacitors. Finally, e2xt4 creates a commit block and writes it to TxSSD.
When ex2xt4 performs write operation for the commit block, it writes the commit block with the txID and a commit flag to TxSSD (a transactional write operation). When TxSSD receives the commit block, it applies the mapping information in the tx table to the FTL mapping table. TxSSD writes all pages to the flash memory according to its policy. After the commit operation, TxSSD empties the tx table and allows garbage collection for the old pages. Before the commit operation, if the written data is not in the page cache or a direct I/O is performed, transactional read operations are performed. E2xt4 reads the data by referencing the tx table because the mapping information associated with the data exists in the tx table. If an I/O operation is failed, e2xt4 performs an abort operation for the txID.
When a system failure or a power outage occurs, TxSSD emergently stores all information, including the tx tables and FTL mapping table, to extra area of the flash memory called a safe region. If the last entry in the tx table is a commit block, TxSSD remaps all mapping information of the tx table before shutdown. When a recovery operation (i.e., mount procedure) is performed, TxSSD discards all mapping information in the uncommitted tx tables. Figure 2 shows an example of an fsync call and the commit procedure after write calls in e2xt4 on top of TxSSD. As shown in the figure, an application updates the data (D 1 , D 2 , and D 3 pages), which is written to the page cache in the file system. The LPNs of D 1 , D 2 , and D 3 are 2, 3, and 4, respectively. E2xt4 also updates the metadata page (M ) with the LPN 5. The old pages for D 1 , D 2 , D 3 , M , and commit block (C ) are stored in TxSSD. From the FTL mapping table in Figure 2(a), we can see that the PPNs of the old pages for D 1 , D 2 , D 3 , M , and C are 0, 1, 2, 3, and 4, respectively. If a running transaction exists, the write operations are merged into the current transaction; otherwise, e2xt4 starts a new transaction. After the updates, the application calls fsync to permanently store the data and metadata by committing the transaction. When the write operations are performed by a page flusher or the fsync call, e2xt4 assigns the updated data and metadata pages in the page cache to a transaction with a newly created txID or compounds them with an existing transaction with an existing txID.
Then, e2xt4 writes the data and metadata pages (D 1 , D 2 , D 3 , and M pages) with the transactional write operation. When TxSSD receives the pages, it makes LPN/PPN mappings for the pages in the tx table of the device. From the tx table of Figure 2(b), the PPNs for the D 1 , D 2 , D 3 , and M pages are 5, 6, 7, and 8, respectively. Then, e2xt4 issues a flush command and creates a commit block (the C page in Figure 2(c)) and writes it to TxSSD. To write the commit block, e2xt4 performs a transactional write operation with the txID and the commit flag to TxSSD. When TxSSD receives the page for the commit block (PPN for the C page is 9 now), TxSSD propagates the LPN/PPN mapping information in the tx table to the FTL mapping table. Later, TxSSD stores all pages (D 1, D 2, D 3, M , and C pages) to the flash memory according to the write policy of TxSSD. After then, TxSSD empties the tx table and permits garbage collection for old pages. From the description above, we see that the write traffic of e2xt4 is approximately the same as that of ext4 without journaling, but e2xt4 provides strong consistency with data integrity as the data journaling mode does (crash consistency).

C. ENHANCED BTRFS (EBTRFS)
Enhanced Btrfs (ebtrfs) is on the basis of the nodatacow mode and transaction model of btrfs. In contrast, ebtrfs does not perform copy-on-write for metadata like the data operations of the nodatacow mode. Ebtrfs writes both metadata and data into their original locations without garbage collection. The commit procedure in ebtrfs is similar to that of e2xt4. Thus, ebtrfs maintains txIDs and keeps the order between the metadata and data updates. When ebtrfs performs a commit operation, it writes metadata and data to TxSSD as transactional operations. And then, ebtrfs issues a flush command and performs a write operation for the superblock as a commit block with a commit flag. Other operations (e.g., read, abort, recovery operations) of ebtrfs are almost similar to those of e2xt4.

D. ENHANCED LUSTRE (ELUSTRE)
Our elustre is based on the ordered mode and the transaction model of ldiskfs. However, elustre does not perform redundant write operations for metadata to SSD in both MDT and OST. Thus, elustre supports transactions and crash consistency without any redundant writes for both metadata and data like e2xt4 and ebtrfs. The metadata managed by MDT contains the layout about the OSTs on which the file objects are located. This layout information is stored in the extended attribute (xattr) section of the inode in the MDT. Elustre processes the transactions for the metadata stored in the MDT and both metadata and data stored in the OST. The metadata operations in the MDT and the OST are similar to those of e2xt4. We write the metadata in a transaction to its original location and disable the checkpoint operation. When a commit operation for the transaction occurs, the metadata gets associated with the transaction by using the txID and written according to the transaction table.
For the data operations in the OST, we allocate a txID to an object. When a new transaction is generated, the transaction associates the object to be written to storage with the txID. Thus, the objects associated with the transaction are written to storage, and the transaction table related to the txID is updated before their metadata are written to storage as the ordered mode of ext4 does. Meanwhile, the metadata and data are processed as a transaction by using the same txID and a commit block for the transaction. Figure 3 shows an example of a write operation in elustre. As shown in the figure, each MDT and OST has a txSSD-aware file system and TxSSD. When a client writes an object (object A) for a file, the client looks up the file and creates a transaction (tx1) in the MDT. The txSSD-aware file system in the MDT allocates a txID (e.g., txID: 1) and associates the txID with the modified metadata of the file. Then, the MDT returns the layout of the file to the client. The client gets the layout information of the file, requests the write operation to OST1, and makes a transaction in OST1. The txSSD-aware file system in the OST allocates a txID (e.g., txID: 1) and associates the txID with the modified metadata and data. The modified metadata in the MDT and the modified metadata and data in the OST are flushed to the transactional SSDs with the commit blocks when the transactions are committed.
As shown in the figure, the modified metadata and the commit block in the MDT are included in a transaction; when the two blocks are written to TxSSD, the LPNs/PPNs of the two blocks are mapped to 1/11 and 2/21, respectively, and written in the transaction table. When the transaction commits, the FTL mapping table is modified. This procedure is also performed in the OST. In the OST, the LPNs/PPNs of the three blocks are mapped to 1/11, 2/21, and 3/31, respectively, and written in the transaction table. Similar to the case of the MDT, the FTL mapping table is modified when the transaction commits. Consequently, our elustre supports the transaction processing for both metadata and data without redundant writes by using the transactional SSDs.

E. IMPLEMENTATION
To enable enhanced file systems and TxSSD, we create a table with 32 entries to map the transaction IDs of the file systems to the txIDs of TxSSD. The existing block I/O subsystem cannot issue transactional operations of TxSSD since the subsystem cannot access to the txID number, commit VOLUME 8, 2020 flag, and recovery/abort flag. Thus, we add three fields to four descriptors (struct bio, struct request, struct scsi_cmd, and struct ata_queued_cmd) in the block I/O subsystem. Additionally, we modify the block I/O subsystem to transfer the added fields from the upper to lower layer of the block I/O subsystem. The modified lines of code in e2xt4, ebtrfs, elustre, and the block I/O subsystem are 193, 202, 214, and 109, respectively. This demonstrates that our scheme requires small modifications, and so our scheme can be easily applied to other file systems.

IV. PERFORMANCE EVALUATION A. EXPERIMENTAL SETUP
To evaluate local file systems, we use a machine, which has two Intel Xeon CPU E5-2670 (2.6 GHz) (total 16 physical cores), 8 GiB DRAM, SATA 3 interface, and Linux 3.14.3. A separate client machine is used for the OLTP evaluation. The machine has two Intel Xeon CPU E7-8837 (2.67 GHz) with 16 physical cores each (64 cores in total with hyperthreading). For TxSSD, we used SM843TN developed by Samsung. It has a capacity of 240 GiB and is designed for high-performance servers and storage in demanding datacenters by providing a powerful controller and power-loss protection. For comparison, we use unmodified SM843TN SSD with ext4 and btrfs and TxSSD with e2xt4 and ebtrfs. We use the FIO benchmark [14] to measure the file I/O performance and the OLTP benchmark [19] to measure the database performance (i.e., transactions per minute (TPM)).
To evaluate the Lustre file system, we use a cluster system, which consists of 8 identical machines connected by a network. Each machine has an Intel Core CPU i7-4790 (3.60 GHz) with 4 physical cores, which total up to 8 cores with hyperthreading, 32 GiB DRAM, SATA 3 interface, and a 10 GbE network card. We configure one node as a MDS with one MDT, which has a TxSSD and six OSSs with each OST, which has a TxSSD, and one node as a client. All servers run CentOS 7 with a Linux kernel 3.10 patched for Lustre. To measure the performance, we use the IOR benchmark [1] and mdtest [2], which are data-intensive and metadata-intensive workloads, respectively. All experimental results report the average value of ten runs.

B. LOCAL FILE SYSTEM PERFORMANCE 1) FILE I/O PERFORMANCE
We present the performance evaluation of our enhanced local file systems on TxSSD using the FIO benchmark [14]. We configure FIO to perform random write operations using 32 threads. Each thread writes 1 GiB file with 4 KiB request size and different numbers of writes per fsync call.

a: ENHANCED EXT4
We compare e2xt4 with the ext4 file system in three different journaling modes such as data journal mode, ordered mode, and journal off mode. We measured the bandwidth, the total amount of writes, and the runtime as shown in Figure 4.  However, the performance of the ordered mode is similar to that of the journal off mode even though the ordered mode has additional overheads such as redundant metadata writes and flush command. It is because the workload is data-intensive and the SSD with supercapacitor mostly returns to the host without flushing the data in the device cache, resulting in almost no overhead for flushing. E2xt4 shows a higher bandwidth of 137 MiB/s (1.46X) compared to ext4 in the data journaling mode. However, the bandwidth of e2xt4 is 10.5% and 13.5% lower than those of the ordered and journal off modes, respectively. This indicates the presence of transactional support overhead, such as remapping transaction tables in TxSSD.
In terms of the total amount of writes, Figure 4(b) shows that the data journaling mode has written 72.9 GiB. Meanwhile, ordered and journal off modes have written approximately 34.2 GiB. This indicates that the amount of the metadata journal and its overhead is almost negligible compared to the data journal. According to the expectation, e2xt4 also shows almost the same amount of total data writes as the journal off mode. In terms of runtime, Figure 4(c) shows that the data journal mode takes 1.66X and 1.70X more time than the ordered and the journal off modes, respectively. The runtime of e2xt4 is 10.8% longer than that of the ordered mode, which is consistent with the bandwidth results provided in Figure 4(a). These results show that providing crash consistency in the existing ext4 has a significant performance tradeoff in terms of the bandwidth, write amplification, and runtime. Meanwhile, e2xt4 shows relatively small overhead while keeping strong consistency with data integrity. Figure 6 shows the FIO performance according to different fsync parameters. This indicates how many I/O operations to perform before issuing an fsync [14]. An fsync operation flushes the dirty metadata and data, thus, a higher fsync frequency results in a lower performance due to high I/O overhead. As shown in the figure, there are performance drops of 9.4%, 10.1%, 8.2%, and 11.0% as the fsync parameter is reduced from 100 to 10 in the data journaling, ordered, 3854 VOLUME 8, 2020  journal off modes, and e2xt4, respectively. As the number of write operations per fsync is reduced from 10 to 1, we observe the performance reduction of 28.1%, 34.5%, 37.6%, and 35.3%. E2xt4 shows 1.50X, 1.51X, 1.47X, and 1.32X performance of the data journaling mode. By comparing e2xt4 and the ordered mode, the overhead of the transactional support in TxSSD is shown as 9.7%, 9.4%, 10.5%, and 10.3% in the case of 1000, 100, 10, and 1 write operations per fsync, respectively. Figure 5 shows the performance of ebtrfs and btrfs using FIO performance. We evaluate the btrfs with the datacow and nodatacow modes. As shown in Figure 5(a), ebtrfs and btrfs with the nodatacow mode improve 7.1X and 8.32X random write performance compared with the default configuration of btrfs. This result shows that btrfs generates a significant performance overhead in trade with crash consistency. Figure 5(b) shows the total amount of written data and runtime. As shown in the figure, there is a larger gap between the btrfs with datacow and nodatacow/ebtrfs It is because datacow not only copies data but also performs garbage collection on obsolete pages by performing discard operations. Figure 7 shows the performance of ebtrfs and btrfs under different fsync parameters. Similar to the case of ext4, as the number of write operations per fsync is reduced, the overall performance is decreased. It also indicates that more frequent copy and garbage collection increases fsync frequency, which results in larger drop in performance. This result demonstrates Ebtrfs has achieved higher performance while keeping the same level of consistency as default btrfs in a wide range of number of write operations per fsync.

2) OLTP PERFORMANCE
We ran the sysbench OLTP benchmark with MySQL 5.6.21 and InnoDB for e2xt4 and ebtrfs to show a more real application performance. Table 2 shows the experimental parameters and other parameters are configured as the default. We configure the page size as 4 KiB instead of the default page size (16 KiB) since a smaller page size leads to better performance [17]. InnoDB supports a technique for guaranteeing the atomicity by performing redundant writes called Double-Write Buffer (DWB). InnoDB first writes and flushes data to a double write buffer area and then writes and flushes each data to its original location. E2xt4 and ebtrfs can disable this technique because they perform transaction processing within TxSSD.   (ordered mode) with or without DWB, and e2xt4 without DWB. As shown in the figure, in the case of the ordered mode with DWB, the TPM increases by 36% from 16 to 32 threads but does not scale after 32 threads. In the case of 64 and 128 threads, the TPM decreases to 52729 (65.9%) and 39384 (49.6%) respectively. The reason of performance reduction is the increased I/O traffic and contention due to DWB. That is, DWB does not have any parallelism and the DWB operation writes to a DWB area sequentially, which harms the scalability. Meanwhile, the performance of ext4 and e2xt4 without DWB scales well since there is no I/O contention on DWB. Ext4 without DWB improves the performance by 2.21X and 3.17X compared to ext4 with DWB in the case of 64 and 128 threads, respectively. The large performance gap shows that enabling DWB incurs a significant overhead in trade with consistency. E2xt4 without DWB improves the performance by 1.23X, 1.19X, 1.95X, and 2.82X in the case of 16, 32, 64, and 128 threads, respectively while providing the same level of consistency as ext4 with DWB. Compared to ext4 without DWB, e2xt4 without DWB shows TPM of 96.3%, 93.7%, 87.9%, and 89.0% with 16, 32, 64, and 128 threads, respectively. This result indicates the transaction processing overhead presented in the TxSSD. Figure 9 shows the OLTP results on btrfs and ebtrfs with four configurations such as default (datacow) without DWB, nodatacow with DWB, nodatacow without DWB, and ebtrfs without DWB. The default mode without DWB and the nodatacow mode with DWB maintain the same level of consistency but with different performance implications. As shown in the figure, the nodatacow with DWB mode shows higher performance than the default without DWB in 16 and 32 threads, respectively. Meanwhile, in the case of 64 and 128 threads, nodatacow with DWB shows 91.3% and 77.9% higher TPM compared to the default btrfs without DWB. This result shows that DWB has a more negative impact on performance than datacow when the number of threads increases.

b: ENHANCED BTRFS
Nodatacow without DWB outperforms the default mode without DWB and the nodatacow mode with DWB due to no redundant write operations for data. Ebtrfs without DWB outperforms the default without DWB by 1.24X, 1.38X, 1.43X, and 1.49X and it also outperforms nodatacow with DWB by 1.14X, 1.22X, 1.57X, and 1.89X in the case of 16, 32, 64, and 128 threads, respectively. Owing to the transactional processing overhead, ebtrfs without DWB generates an overhead of 6.1%, 4.5%, 7.5%, and 2.2% compared with nodatacow without DWB in the case of 16, 32, 64, and 128 threads, respectively. However, this overhead is much smaller than the overheads generated by datacow or DWB. Consequently, our results demonstrate that ebtrfs keeps the same level of consistency like datacow or DWB with small overhead.

3) RECOVERY PERFORMANCE
To measure recovery time, we cut the power of the machine while it was executing the OLTP benchmark. Table 3 shows the recovery time after rebooting the machine. The recovery time of e2xt4 is 0.8 ms. Meanwhile the recovery time of the ordered mode with DWB is 249.3 ms. The recovery procedure of the ordered mode performs the scan and replay operations for the metadata. The datacow mode of btrfs increases the recovery time since the datacow mode reconstructs its tree nodes for both data and metadata. Meanwhile, the nodatacow mode of btrfs with DWB reduces the recovery time since the nodatacow mode only recovers the metadata. Similar to e2xt4, the recovery time of ebtrfs is only 0.7 ms. Meanwhile, the recovery time of the datacow mode without DWB and the nodatacow mode with DWB in btrfs is 1379.5 ms and 20.6 ms, respectively. In summary, recovery time of e2xt4 and ebtrfs is the shortest while providing the same level of crash consistency. The reason is that e2xt4 and ebtrfs only perform the recovery operation to TxSSD in which all mapping information in the uncommitted tx tables is just discarded.

C. DISTRIBUTED PARALLEL FILE SYSTEM PERFORMANCE
We use mdtest and IOR which are widely used to measure the performance of distributed parallel file systems. Mdtest is an MPI-coordinated metadata-intensive benchmark. Each task creates, stats, and removes the specified number of directories and/or files and measures the performance in operations per second. IOR is an MPI-coordinated data-intensive benchmark with various interfaces and access patterns. We configured both IOR and mdtest in various numbers of options and threads. We note that the current existing Lustre file system does not support the data journaling mode due to the performance issue. Thus, we compare our elustre file system with the existing Lustre file system in the ordered mode, and the consistency level of our enhanced Lustre file system is higher than that of the existing Lustre file system.

1) MDTEST PERFORMANCE
We evaluate the metadata I/O performance in existing and enhanced Lustre file systems by using the mdtest benchmark. In our evaluation, we set five branch, five depth, and ten items, which means the mdtest first creates a directory tree with five branch and five depth, and each tree node creates ten items (files or directories). We set the number of bytes to write to each file after it is created as 4 KiB. Figure 10 shows the operations per second for the directory/file creations and the directory/file removal when the number of threads is increased. Figure 10(a) shows the performance of the directory creation operations. As shown in the figure, elustre improves the performance by 8%, 7%, 7%, 6%, 18%, 13%, and 12% compared to Lustre at each number of threads, respectively. For the improvement, the reason is that mdtest generates metadata-intensive operations, and elustre processes the transaction for the metadata without redundant writes while Lustre processes the transaction with redundant writes. Thus, elustre improves the metadata-intensive performance compared to Lustre in the ordered mode. Figure 10(b) shows the performance for file creation operations. The results of the two file systems are similar. When the number of threads is 32, we improve the performance up to 4.4%. Figure 10(c) shows the performance of directory removal operations. As shown in this figure, elustre improves the performance by 4.3%, 0.3%, 38%, 10%, 13.7%, 8.4%, and 8.3% at each number of threads, respectively, compared to Lustre. Figure 10(d) shows the performance of the file removal operations. Elustre improves the performance by 3.3%, 0.2%, 4.1%, 24.7%, 27.9%, 32.9%, and 42% at each number of threads, respectively, compared to Lustre. When the number of threads is 64, elustre achieves maximum improvement. Consequently, we show a similar or better performance compared to the existing file system while providing a higher consistency level. It is because that the existing file system provides the transactions for the metadata in the MDT and the OST but does not provide the transactions for data for the OST in which writes the data of the file.

2) IOR PERFORMANCE
We evaluate existing and enhanced Lustre file systems by using the IOR benchmark. Figure 11(a) shows the random write performance with different request sizes when the number of threads is eight and the number of files is one per thread, which means each thread creates a single file with six stripes, and these stripes will be distributed into six OSTs. The overall performance increases as the number of threads increases. but the performance is decreased at 64 threads. As shown in the figure, the performance of the two file systems is similar. It is because that IOR generates data-intensive operations, and most I/O operations are performed in the OSTs with small metadata I/Os in both MDT and OSTs unlike the case of mdtest. Unlike the result of local file systems, the overhead of TxSSD is hidden since many layers and components of Lustre and network overhead generate a longer latency compared to a local file system. In terms of the consistency level, elustre provides a higher consistency level than that of Lustre since Lustre in the ordered mode only supports the transaction for metadata, but our elustre supports the transaction for metadata and data. Figure 11(b) shows the result of random write in different threads. In this evaluation, we set the request size as 16 KiB. As shown in the figure, the performance of the two file systems is similar as the performance results with different request sizes. When the number of threads is 32, the two file systems show the highest bandwidth. The bandwidth of elustre and Lustre is 393.92 MiB/s and 384.11 MiB/s, respectively. This demonstrates that the overhead of TxSSD is completely hidden, and elustre shows a slightly better performance even with fewer metadata operations. Figure 12 shows the performance of fsync per random write operation in the different number of threads. The frequent fsync call decreases the overall performance. Meanwhile, the performance increases as the number of threads increases. Similar to other results of the IOR benchmark, the performance of the two file systems is similar. This result shows that unlike the case of local file systems, the fsync call affects the performance less in the distributed parallel file system.

D. EXPERIMENTAL ANALYSIS
We disclose the overhead of the transactional support from our file systems without the overhead inside TxSSD. To do this, we disable the transactional functionality of TxSSD. Then, we compare the our modified version with the unmodified version in order to disclose the overhead of our implementation. Table 4 depicts that there is almost no overhead of the transaction support in our file systems. Thus, the overhead is attributed to the overhead of the transactional support generated by the TxSSD. To support this claim, we measure the latency of normal and transactional operations with a request size of 4 KiB using one thread. To reduce the overhead of measurement, we use a raw device and the direct I/O mode. In the raw device, we use commit-on-flush to commit a transaction with the flush command, which remaps the entries in the tx table to the entries in the FTL mapping table because there is no commit block. Figure 13 depicts the average latency of transactional operations (TxWrite/Read, Commit, Abort, and Recovery) and normal operations (NoramlWrite/Read/Flush) under different numbers of entries. For instance, 10 means that 10 entries are updated and remapped to the FTL mapping table in TxWrite and the commit operation, respectively. There is almost no performance gap between the TxWrite/Read and Normal-Write/Read. This result demonstrates that update and search operations hardly affect performance in TxSSD.
Meanwhile, the commit operations produce a considerable overhead compared with normal flush. In the SSD firmware, the time taken by the remap function is 2474us, which is similar to the measured result in the host side. In the case of commodity or Open SSDs, the time taken by the flushing command is a few milliseconds. Meanwhile, in the case of our SSD, the flushing command produces about 30 us because it returns instantly due to supercapacitors. Thus, the overhead of remap operations 5 becomes more noticeable to the host. It affects the performance of applications as the number of entries increases.
The latency of recovery operation is longer than that of the abort operation because the recovery operation discards mapping information in all tx tables. Even though the number of entries increases, the latencies for the abort and recovery operations are not increased. The reason is that TxSSD deallocates the tx tables with the uncommitted entries.

V. RELATED WORK A. FILE SYSTEMS FOR FLASH-BASED SSDs
There are many studies on file systems for flash SSDs. F2FS [20] is a file system designed for flash SSDs. F2FS devises a flash-friendly on-disk layout to avoid unnecessary data copying and multi-head logging for optimizing the write performance. ParaFS [47] is a log-structured file system for flash SSD to exploit internal parallelism inside SSD while ensuring efficient garbage collection. ParaFS coordinates the garbage collection at both the file system and FTL levels. It also schedules read, write, and erase requests over multiple channels to achieve consistent performance.
SpanFS [16] is a scalable file system for flash-based SSDs. SpanFS consists of a collection of micro file system services called domain to improve the scalability of file systems on many cores. It distributes files and directories among the domains and provides a global file system view on top of the domains. This article is in line with the previous works [16], [20], [47] in terms of improving the performance of file systems based on flash-based SSDs. In contrast, we focus on improving and evaluating the performance of file systems on flash-based SSDs that support the transaction functionality.

B. DISTRIBUTED PARALLEL FILE SYSTEMS
There have been several studies on distributed parallel file systems. Devulapalli and Wycoff analyze strategies for file creation in file systems that distribute metadata across multiple servers. They present designs, which reduce the message complexity of the create operation and increase the performance. IndexFS [35] provides scalable high-performance operations on the metadata and small files for existing file systems, such as PVFS, Lustre, and HDFS. IndexFS uses a table-based architecture and an optimized log-structured layout that stores the metadata and small files efficiently.
PLFS [6] is a parallel log structured file system. PLFS remaps an application's preferred data layout into one which is optimized for the underlying file system. The layer of indirection and reorganization reduces the checkpoint time. Piernas et al. [31] propose a user-space implementation of active storage for Lustre and compare it with the traditional kernel-based implementations. They show that the user-space approach prove to be faster, more flexible, portable, and deployable than the kernel-space approaches.
Oral et al. [29] found that journaling in Lustre for the object store considerably affects the overall performance. To increase the overall performance of the file system, they provide a hardware solution using external journaling devices and propose software-based optimization to remove VOLUME 8, 2020 the synchronous commit. Our work is in line with these studies [6], [11], [29], [31], [35] in terms of investigating the distributed parallel file systems and their performance. In contrast, we focus on improving the performance using TxSSDs and investigate their implications.

C. TRANSACTIONAL SSDs
There are many studies for supporting transactional functionality in SSDs. The transactional SSD concept was first introduced by TxFlash [33], which provides cyclic commit protocols. TxFlash links all pages in each transaction in one cyclic list by keeping pointers in the page metadata. The cyclic commit uses per-page metadata to remove the need for a separate commit record. It requires judging whether a transaction is committed or not. Meanwhile, this scheme can be inappropriate for current enterprise SSDs. The reason is that the metadata area can be reserved for multiple purposes to contain information about error correcting code (ECC) data, bad blocks, etc [42].
LightTx [22], [23] supports transaction flexibility using a lightweight embedded transaction design. LightTx uses a commit protocol that determines the transaction state solely inside each transaction in order to support parallel transaction execution. In addition, LightTx periodically retires the dead transactions to reduce transaction state tracking cost. DiffTx [24] is an embedded transaction protocol which differentially logs partial page updates in a write-ahead logging way and writes full page updates in a shadow paging way, aiming at low write amplification. TxCache [25] is a new embedded transaction mechanism for SSDs with non-volatile disk cache. TxCache design leverages non-volatile disk cache to efficiently support transactions inside SSDs. It persists new-version data in non-volatile disk cache in a shadow way while protecting old-version data from being overwritten. LightTx, DiffTx, and TxCache are design and implemented on a trace-driven SSD simulator based on DiskSim. In contrast with these studies, we aim to show the experimental results by using SSDs which satisfies enterprise and standard interface.
Shi et al. [39] is a transactional SSD design which provides different types of transactional primitives to support static and dynamic transactions separately. Mobius flash translation layer (mFTL) combines normal FTL with transaction processing by storing mapping and transaction information together in a physical flash page by using out-of-band (OOB). In this case, this design may not adapt to other flash-based SSDs. It is because that OOB area can be used for error correction code (ECC) and metadata by each vendor. Mobius is designed and implemented on openSSD platform which is not enterprise-class SSD. The one of main differences point between OpenSSD and enterprise SSD is whether the supercapacitor is supported or not. In the case of enterprise SSD, the supercapacitor is supported so that the flush command overhead is very low. Thus, the results of evaluation can be totally difference between OpenSSD and enterprise SSD. This paper discloses the evaluation results and analyze the results.
FusionIO [30] presents an atomic-write in an enterprise flash-based SSD. It provides atomic write that puts a batch of multiple I/O operations into a single logical group, which is persisted successfully or rolled back upon a failure. FusionIO modifies MySQL to use the atomic write call. Meanwhile, there is no performance evaluation on file systems. We also focus on the performance for transactional SSD based on SATA and the firmware FTL rather than a host-based FTL that consumes host resources. X-FTL [18] improves the performance of SQLite by exploiting the transactional atomicity provided by SSDs. SQLite is a DBMS used by Android phones, which relies on costly page-oriented journaling to support transactional atomicity. This results in slow responses to mobile applications. CFS [27] is a file system built on X-FTL [18]. CFS guarantees crash consistency in application level by enabling applications to declare arbitrary code regions which is required for providing crash consistency. CFS improves performance of SQLite and solves the problem of false sharing of metadata. Unfortunately, their [18], [27] performance evaluation is based on openSSD [3], which is not an enterprise-class SSD.
SHARE interface [28] provides an abstraction which allows applications to change the address mapping inside flash storage. It allows the applications to achieve write atomicity without write amplification. The goal of the previous studies [18], [22]- [24], [27], [28], [30], [33], [39] is similar to our goal in terms of supporting transactional functionality inside SSD. Meanwhile, to the best of our knowledge, this work is the first evaluation study on transactional SSDs that satisfies both enterprise and SATA interface to provide performance results for a more realistic and standard fashion. Furthermore, we evaluate the effectiveness of transactional SSDs in a distributed parallel file system, as well as local file systems.

VI. SUMMARY AND IMPLICATION
We summarize implications of our evaluation study. Especially, we make the following main findings and insights throughout our measurements and observations.
• Redundant writes for providing crash consistency affect the performance and endurance, especially in btrfs, while the overhead from flush operations is small due to the supercapacitor.
• The overhead of memory copy operations inside TxSSD during the commit operation is noticeable because the overhead of the flush command is low. Except for the commit operations, the latency of other transactional operations on TxSSD is the same as those on the original SSD.
• E2xt4 and ebtrfs on TxSSD improve I/O and recovery performance compared with the original ext4 and btrfs while providing the same consistency level. However, the I/O performance in the data-intensive workloads is slightly lower than that of the weak-consistent modes (ordered and nodatacow) due to the overhead of transactional support inside TxSSD.
• Furthermore, elustre with multiple TxSSDs improves the I/O performance in the metadata-intensive workload and shows a similar performance in the data-intensive workload but provides a higher consistency level compared with Lustre.

VII. CONCLUSION
In this article, we investigated the implications of transactional SSDs. We evaluated the effect of transactional SSDs with diverse file systems and configurations. Then we found insights and observations through our evaluation.
Our results show that TxSSD-aware file systems increase the performance compared with crash-consistent modes while maintaining crash consistency. Additionally, they improve the endurance of SSDs because the total number of writes decreases. Finally, we show that providing transactional functionality in SSDs may incur overhead. It must be considered carefully when we design transactional functionality or in-storage computing.