distributed_computing:data_processing:formats

Formats

  • N columns
  • M row-groups
  • Metadata: Location of all column metadata start locations
  • Metadata written after the data for single pass writing
  • First metadata should be read to find column chunks
  • Non-nested schema: Nulls encoded with run-length encoding (0, 1000 times)
  • Plain = 0
  • Dictionary encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)
  • Run Lenght Encoding / Bit-Packing Hybrid (RLE = 3)

https://github.com/apache/parquet-format/blob/master/Encodings.md

  • distributed_computing/data_processing/formats.txt
  • Last modified: 2019/10/28 22:48
  • by phreazer