Community Conundrum 28: News Engagement is live! Read More

Compute metrics are slow for the .parquet format

Level 1
Compute metrics are slow for the .parquet format
Hello,

My DSS instance computes from HDSF :

- a json file of 6Go in a duration around 1 minute

- but a parquet file, it takes >12min for 4,5Go

Is it possible to reduce the delay for the .parquet files ?



Thanks for your answer.

Best Regards
0 Kudos
3 Replies
Dataiker
Dataiker

Hello,



What is the computation setting for your metrics? You can find it in Dataset > Status > Edit > Edit computation settings (see below):





If your dataset comes from HDFS (your case), I advise selecting only the Hive or Impala engine (check with your Hadoop admin if Impala is installed). Note that Impala should be way faster than Hive.



If your dataset came from a regular filesystem, then indeed the only way for DSS to compute metrics like count of records is to stream the entire file, which can take time. 



Cheers,



Alex

0 Kudos
Level 1
Author
Hello,
thanks for your answer.

In the current configuration, i unselect Hive because it is not installed so it generates an error if i check it.
Stream, SQL and Impala are checked but Impala is not installed too, so i think it uses the Stream mode.
0 Kudos
Dataiker
Dataiker
Best is to ask your Hadoop admin then, at least for a Hive database configuration. You can point him/her to the documentation: https://doc.dataiku.com/dss/latest/hadoop/hive.html
0 Kudos
Labels (2)
A banner prompting to get Dataiku DSS