This is an old revision of the document!
Formats
Parquet
- Column-oriented
- Dictionary encoding
- Record-shredding and assembly algorithm (dremel encoding): https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper
- N columns
- M row-groups
- Metadata: Location of all column metadata start locations
- Metadata written after the data for single pass writing
- First metadata should be read to find column chunks
- Non-nested schema: Nulls encoded with run-length encoding (0, 1000 times)
Implementations
- Read/Write in Spark
- Parallel column read in pyarrow: https://wesmckinney.com/blog/python-parquet-multithreading/