Show pageOld revisionsBacklinksBack to top This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong. ====== Small file problem ====== Memory overhead * File/Directory/Block is represented as object in namenode's memory and occupies 150 bytes. 10.000.000 small files => (10.000.000 blocks * replication factor + 10.000.000 file inodes ) * 150 bytes => 3 Gb * Namenode is limited by main memory ===== Solutions ===== * HAR: Hadoop archive; Layered file system on top of HDFS; Used for archiving (slow read) * Sequence file: File name as key, contents as the value * Consolidator * HBase: Stores data in indexed SequenceFiles (HBase) * Spark compaction: https://github.com/KeithSSmith/spark-compaction * Filecrush: https://github.com/asdaraujo/filecrush distributed_computing/data_processing/hadoop/hdfs/small_files.txt Last modified: 2019/10/25 19:55by phreazer