Is it possible for Parquet to compress the summary file (_metadata) in the MR job? -
right using mapreduce job convert data , store result in parquet format.
there summary file (_metadata) generated well. problem is big (over 5g). there way reduce size?
credits alex levenson , ryan blue:
alex levenson:
you can push reading of summary file mappers instead of reading on submitter node:
parquetinputformat.settasksidemetadata(conf, true);
(ryan blue: default 1.6.0 forward)
or setting "parquet.task.side.metadata" true in configuration. had similar issue, default client reads summary file on submitter node takes lot of time , memory. flag fixes issue instead reading each individual file's metadata file footer in mappers (each mapper reads metadata needs).
another option, we've been talking in past, disable creating metadata file @ all, we've seen creating can expensive too, , if use task side metadata approach, it's never used.
(ryan blue: there's option suppress files, recommend. file metadata handled on tasks, there's not need summary files.)
Comments
Post a Comment