Table of Contents
Formats
Parquet
Encodings
Implementations
Formats
Parquet
Column-oriented
Dictionary encoding
Record-shredding and assembly algorithm (dremel encoding):
https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper
N columns
M row-groups
Metadata: Location of all column metadata start locations
Metadata written after the data for single pass writing
First metadata should be read to find column chunks
Non-nested schema: Nulls encoded with run-length encoding (0, 1000 times)
Encodings
Plain = 0
Dictionary encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)
Run Lenght Encoding / Bit-Packing Hybrid (RLE = 3)
…
https://github.com/apache/parquet-format/blob/master/Encodings.md
Implementations
Read/Write in Spark
Parallel column read in pyarrow:
https://wesmckinney.com/blog/python-parquet-multithreading/