Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
distributed_computing:data_processing:formats [2019/10/28 21:36] – created phreazer | distributed_computing:data_processing:formats [2019/10/28 21:48] (current) – [Parquet] phreazer | ||
---|---|---|---|
Line 5: | Line 5: | ||
* Column-oriented | * Column-oriented | ||
* Dictionary encoding | * Dictionary encoding | ||
+ | * Record-shredding and assembly algorithm (dremel encoding): https:// | ||
+ | |||
+ | * N columns | ||
+ | * M row-groups | ||
+ | * Metadata: Location of all column metadata start locations | ||
+ | * Metadata written after the data for single pass writing | ||
+ | * First metadata should be read to find column chunks | ||
+ | * Non-nested schema: Nulls encoded with run-length encoding (0, 1000 times) | ||
+ | |||
+ | ==== Encodings ==== | ||
+ | |||
+ | * Plain = 0 | ||
+ | * Dictionary encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8) | ||
+ | * Run Lenght Encoding / Bit-Packing Hybrid (RLE = 3) | ||
+ | * ... | ||
+ | |||
+ | https:// | ||
+ | |||
==== Implementations ==== | ==== Implementations ==== | ||
- | * Read/Write in Spark | + | |
- | * Parallel column read in pyarrow: https:// | + | * Parallel column read in pyarrow: https:// |