Compute metrics are slow for the .parquet format

datauser
Level 1
Compute metrics are slow for the .parquet format
Hello,

My DSS instance computes from HDSF :

- a json file of 6Go in a duration around 1 minute

- but a parquet file, it takes >12min for 4,5Go

Is it possible to reduce the delay for the .parquet files ?



Thanks for your answer.

Best Regards
0 Kudos
3 Replies
Alex_Combessie
Dataiker Alumni

Hello,



What is the computation setting for your metrics? You can find it in Dataset > Status > Edit > Edit computation settings (see below):





If your dataset comes from HDFS (your case), I advise selecting only the Hive or Impala engine (check with your Hadoop admin if Impala is installed). Note that Impala should be way faster than Hive.



If your dataset came from a regular filesystem, then indeed the only way for DSS to compute metrics like count of records is to stream the entire file, which can take time. 



Cheers,



Alex

0 Kudos
datauser
Level 1
Author
Hello,
thanks for your answer.

In the current configuration, i unselect Hive because it is not installed so it generates an error if i check it.
Stream, SQL and Impala are checked but Impala is not installed too, so i think it uses the Stream mode.
0 Kudos
Alex_Combessie
Dataiker Alumni
Best is to ask your Hadoop admin then, at least for a Hive database configuration. You can point him/her to the documentation: https://doc.dataiku.com/dss/latest/hadoop/hive.html
0 Kudos