Formats
Parquet
- Column-oriented
- Dictionary encoding
- Record-shredding and assembly algorithm (dremel encoding): https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper
- N columns
- M row-groups
- Metadata: Location of all column metadata start locations
- Metadata written after the data for single pass writing
- First metadata should be read to find column chunks
- Non-nested schema: Nulls encoded with run-length encoding (0, 1000 times)
Encodings
- Plain = 0
- Dictionary encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)
- Run Lenght Encoding / Bit-Packing Hybrid (RLE = 3)
- …
https://github.com/apache/parquet-format/blob/master/Encodings.md
Implementations
- Read/Write in Spark
- Parallel column read in pyarrow: https://wesmckinney.com/blog/python-parquet-multithreading/