Compute metrics are slow for the .parquet format
datauser
Registered Posts: 2 ✭✭✭✭
Hello,
My DSS instance computes from HDSF :
- a json file of 6Go in a duration around 1 minute
- but a parquet file, it takes >12min for 4,5Go
Is it possible to reduce the delay for the .parquet files ?
Thanks for your answer.
Best Regards
My DSS instance computes from HDSF :
- a json file of 6Go in a duration around 1 minute
- but a parquet file, it takes >12min for 4,5Go
Is it possible to reduce the delay for the .parquet files ?
Thanks for your answer.
Best Regards
Tagged:
Answers
-
Hello,
What is the computation setting for your metrics? You can find it in Dataset > Status > Edit > Edit computation settings (see below):
If your dataset comes from HDFS (your case), I advise selecting only the Hive or Impala engine (check with your Hadoop admin if Impala is installed). Note that Impala should be way faster than Hive.
If your dataset came from a regular filesystem, then indeed the only way for DSS to compute metrics like count of records is to stream the entire file, which can take time.
Cheers,
Alex
-
Hello,
thanks for your answer.
In the current configuration, i unselect Hive because it is not installed so it generates an error if i check it.
Stream, SQL and Impala are checked but Impala is not installed too, so i think it uses the Stream mode. -
Best is to ask your Hadoop admin then, at least for a Hive database configuration. You can point him/her to the documentation: https://doc.dataiku.com/dss/latest/hadoop/hive.html