distributed_computing:data_processing:formats [AE Wiki]

This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong.

====== Formats ======

===== Parquet =====

  * Column-oriented
  * Dictionary encoding
  * Record-shredding and assembly algorithm (dremel encoding): https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper

  * N columns
  * M row-groups
  * Metadata: Location of all column metadata start locations
  * Metadata written after the data for single pass writing
  * First metadata should be read to find column chunks
  * Non-nested schema: Nulls encoded with run-length encoding (0, 1000 times)

==== Encodings ====

  * Plain = 0
  * Dictionary encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)
  * Run Lenght Encoding / Bit-Packing Hybrid (RLE = 3)
  * ...

https://github.com/apache/parquet-format/blob/master/Encodings.md


==== Implementations ====

  * Read/Write in Spark
  * Parallel column read in pyarrow: https://wesmckinney.com/blog/python-parquet-multithreading/