distributed_computing:data_processing:hadoop:hdfs:small_files

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
distributed_computing:data_processing:hadoop:hdfs:small_files [2019/10/25 21:05] – [Solutions] phreazerdistributed_computing:data_processing:hadoop:hdfs:small_files [2019/10/25 21:55] (current) – [Solutions] phreazer
Line 2: Line 2:
  
 Memory overhead Memory overhead
-  * File/Directory/Block is represented as object in namenode's memory and occupies 150 bytes. 10.000.000 small files => (10.000.000 blocks + 10.000.000 files) * 150 bytes => 3 Gb+  * File/Directory/Block is represented as object in namenode's memory and occupies 150 bytes. 10.000.000 small files => (10.000.000 blocks * replication factor + 10.000.000 file inodes ) * 150 bytes => 3 Gb
   * Namenode is limited by main memory   * Namenode is limited by main memory
  
Line 11: Line 11:
   * Consolidator   * Consolidator
   * HBase: Stores data in indexed SequenceFiles (HBase)   * HBase: Stores data in indexed SequenceFiles (HBase)
 +  * Spark compaction: https://github.com/KeithSSmith/spark-compaction
 +  * Filecrush: https://github.com/asdaraujo/filecrush
  • distributed_computing/data_processing/hadoop/hdfs/small_files.1572030352.txt.gz
  • Last modified: 2019/10/25 21:05
  • by phreazer