Compute metrics are slow for the .parquet format

datauser · February 2018

Hello,

My DSS instance computes from HDSF :

- a json file of 6Go in a duration around 1 minute

- but a parquet file, it takes >12min for 4,5Go

Is it possible to reduce the delay for the .parquet files ?

Thanks for your answer.

Best Regards

Alex_Combessie · February 2018

Hello,

What is the computation setting for your metrics? You can find it in Dataset > Status > Edit > Edit computation settings (see below):

If your dataset comes from HDFS (your case), I advise selecting only the Hive or Impala engine (check with your Hadoop admin if Impala is installed). Note that Impala should be way faster than Hive.

If your dataset came from a regular filesystem, then indeed the only way for DSS to compute metrics like count of records is to stream the entire file, which can take time.

Cheers,

Alex

datauser · February 2018

Hello,
thanks for your answer.

In the current configuration, i unselect Hive because it is not installed so it generates an error if i check it.
Stream, SQL and Impala are checked but Impala is not installed too, so i think it uses the Stream mode.

Alex_Combessie · February 2018

Best is to ask your Hadoop admin then, at least for a Hive database configuration. You can point him/her to the documentation: https://doc.dataiku.com/dss/latest/hadoop/hive.html

Compute metrics are slow for the .parquet format

Answers

Categories

Setup Info

Tags