distributed_computing:data_processing:formats

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
distributed_computing:data_processing:formats [2019/10/28 22:36] – created phreazerdistributed_computing:data_processing:formats [2019/10/28 22:48] (current) – [Parquet] phreazer
Line 5: Line 5:
   * Column-oriented   * Column-oriented
   * Dictionary encoding   * Dictionary encoding
 +  * Record-shredding and assembly algorithm (dremel encoding): https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper
 +
 +  * N columns
 +  * M row-groups
 +  * Metadata: Location of all column metadata start locations
 +  * Metadata written after the data for single pass writing
 +  * First metadata should be read to find column chunks
 +  * Non-nested schema: Nulls encoded with run-length encoding (0, 1000 times)
 +
 +==== Encodings ====
 +
 +  * Plain = 0
 +  * Dictionary encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)
 +  * Run Lenght Encoding / Bit-Packing Hybrid (RLE = 3)
 +  * ...
 +
 +https://github.com/apache/parquet-format/blob/master/Encodings.md
 +
  
 ==== Implementations ==== ==== Implementations ====
  
-* Read/Write in Spark +  * Read/Write in Spark 
-* Parallel column read in pyarrow: https://wesmckinney.com/blog/python-parquet-multithreading/+  * Parallel column read in pyarrow: https://wesmckinney.com/blog/python-parquet-multithreading/
  
  • distributed_computing/data_processing/formats.1572298610.txt.gz
  • Last modified: 2019/10/28 22:36
  • by phreazer