Differences

This shows you the differences between two versions of the page.

--- distributed_computing:data_processing:formats [2019/10/28 22:36] – created phreazer
+++ distributed_computing:data_processing:formats [2019/10/28 22:48] (current) – [Parquet] phreazer
@@ Line 5: / Line 5: @@
   * Column-oriented
   * Dictionary encoding
+  * Record-shredding and assembly algorithm (dremel encoding): https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper
+  * N columns
+  * M row-groups
+  * Metadata: Location of all column metadata start locations
+  * Metadata written after the data for single pass writing
+  * First metadata should be read to find column chunks
+  * Non-nested schema: Nulls encoded with run-length encoding (0, 1000 times)
+==== Encodings ====
+  * Plain = 0
+  * Dictionary encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)
+  * Run Lenght Encoding / Bit-Packing Hybrid (RLE = 3)
+  * ...
+https://github.com/apache/parquet-format/blob/master/Encodings.md
 ==== Implementations ====
-* Read/Write in Spark
+  * Read/Write in Spark
-* Parallel column read in pyarrow: https://wesmckinney.com/blog/python-parquet-multithreading/
+  * Parallel column read in pyarrow: https://wesmckinney.com/blog/python-parquet-multithreading/