====== Small file problem ====== Memory overhead * File/Directory/Block is represented as object in namenode's memory and occupies 150 bytes. 10.000.000 small files => (10.000.000 blocks * replication factor + 10.000.000 file inodes ) * 150 bytes => 3 Gb * Namenode is limited by main memory ===== Solutions ===== * HAR: Hadoop archive; Layered file system on top of HDFS; Used for archiving (slow read) * Sequence file: File name as key, contents as the value * Consolidator * HBase: Stores data in indexed SequenceFiles (HBase) * Spark compaction: https://github.com/KeithSSmith/spark-compaction * Filecrush: https://github.com/asdaraujo/filecrush