Small file problem
Memory overhead
- File/Directory/Block is represented as object in namenode's memory and occupies 150 bytes. 10.000.000 small files ⇒ (10.000.000 blocks * replication factor + 10.000.000 file inodes ) * 150 bytes ⇒ 3 Gb
- Namenode is limited by main memory
Solutions
- HAR: Hadoop archive; Layered file system on top of HDFS; Used for archiving (slow read)
- Sequence file: File name as key, contents as the value
- Consolidator
- HBase: Stores data in indexed SequenceFiles (HBase)
- Spark compaction: https://github.com/KeithSSmith/spark-compaction
- Filecrush: https://github.com/asdaraujo/filecrush