distributed_computing:data_processing:hadoop:hdfs:small_files

This is an old revision of the document!


Small file problem

Memory overhead

  • File/Directory/Block is represented as object in namenode's memory and occupies 150 bytes. 10.000.000 small files ⇒ (10.000.000 blocks * replication factor + 10.000.000 file inodes ) * 150 bytes ⇒ 3 Gb
  • Namenode is limited by main memory
  • HAR: Hadoop archive; Layered file system on top of HDFS; Used for archiving (slow read)
  • Sequence file: File name as key, contents as the value
  • Consolidator
  • HBase: Stores data in indexed SequenceFiles (HBase)
  • distributed_computing/data_processing/hadoop/hdfs/small_files.1572032001.txt.gz
  • Last modified: 2019/10/25 21:33
  • by phreazer