Show pageOld revisionsBacklinksBack to top This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong. ====== Formats ====== ===== Parquet ===== * Column-oriented * Dictionary encoding * Record-shredding and assembly algorithm (dremel encoding): https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper * N columns * M row-groups * Metadata: Location of all column metadata start locations * Metadata written after the data for single pass writing * First metadata should be read to find column chunks * Non-nested schema: Nulls encoded with run-length encoding (0, 1000 times) ==== Encodings ==== * Plain = 0 * Dictionary encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8) * Run Lenght Encoding / Bit-Packing Hybrid (RLE = 3) * ... https://github.com/apache/parquet-format/blob/master/Encodings.md ==== Implementations ==== * Read/Write in Spark * Parallel column read in pyarrow: https://wesmckinney.com/blog/python-parquet-multithreading/ distributed_computing/data_processing/formats.txt Last modified: 2019/10/28 21:48by phreazer