distributed_computing:data_processing:formats [AE Wiki]

distributed_computing:data_processing:formats

This is an old revision of the document!

Column-oriented
Dictionary encoding
Record-shredding and assembly algorithm (dremel encoding): https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper

N columns
M row-groups
Metadata: Location of all column metadata start locations
Metadata written after the data for single pass writing
First metadata should be read to find column chunks
Non-nested schema: Nulls encoded with run-length encoding (0, 1000 times)

Read/Write in Spark
Parallel column read in pyarrow: https://wesmckinney.com/blog/python-parquet-multithreading/

distributed_computing/data_processing/formats.1572299129.txt.gz
Last modified: 2019/10/28 22:45
by phreazer