By Topic

A novel indexing scheme for efficient handling of small files in Hadoop Distributed File System

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

5 Author(s)
S. Chandrasekar ; SSN College of Engineering, Kalavakkam, Tamilnadu, India ; R. Dakshinamurthy ; P G Seshakumar ; B. Prabavathy
more authors

Hadoop Distributed File System (HDFS) is designed for reliable storage and management of very large files. All the files in HDFS are managed by a single server, the NameNode. NameNode stores metadata, in its main memory, for each file stored into HDFS. As a consequence, HDFS suffers a performance penalty with increased number of small files. Storing and managing a large number of small files imposes a heavy burden on the NameNode. The number of files that can be stored into HDFS is constrained by the size of NameNode's main memory. Further, HDFS does not take the correlation among files into account, and it does not provide any prefetching mechanism to improve the I/O performance. In order to improve the efficiency of storing and accessing the small files on HDFS, we propose a solution based on the works of Dong et al., namely Extended Hadoop Distributed File System (EHDFS). In this approach, a set of correlated files is combined, as identified by the client, into a single large file to reduce the file count. An indexing mechanism has been built to access the individual files from the corresponding combined file. Further, index prefetching is also provided to improve I/O performance and minimize the load on NameNode. The experimental results indicate that EHDFS is able to reduce the metadata footprint on NameNode's main memory by 16% and also improve the efficiency of storing and accessing large number of small files.

Published in:

Computer Communication and Informatics (ICCCI), 2013 International Conference on

Date of Conference:

4-6 Jan. 2013