Is it possible for Parquet to compress the summary file (_metadata) in the MR job? -


right using mapreduce job convert data , store result in parquet format.

there summary file (_metadata) generated well. problem is big (over 5g). there way reduce size?

credits alex levenson , ryan blue:

alex levenson:

you can push reading of summary file mappers instead of reading on submitter node:

parquetinputformat.settasksidemetadata(conf, true);

(ryan blue: default 1.6.0 forward)

or setting "parquet.task.side.metadata" true in configuration. had similar issue, default client reads summary file on submitter node takes lot of time , memory. flag fixes issue instead reading each individual file's metadata file footer in mappers (each mapper reads metadata needs).

another option, we've been talking in past, disable creating metadata file @ all, we've seen creating can expensive too, , if use task side metadata approach, it's never used.

(ryan blue: there's option suppress files, recommend. file metadata handled on tasks, there's not need summary files.)


Comments

Popular posts from this blog

qt - Using float or double for own QML classes -

Create Outlook appointment via C# .Net -

ios - Swift Array Resetting Itself -