Compute metrics are slow for the .parquet format

Options
datauser
datauser Registered Posts: 2 ✭✭✭✭
Hello,

My DSS instance computes from HDSF :

- a json file of 6Go in a duration around 1 minute

- but a parquet file, it takes >12min for 4,5Go

Is it possible to reduce the delay for the .parquet files ?



Thanks for your answer.

Best Regards

Answers

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Options

    Hello,

    What is the computation setting for your metrics? You can find it in Dataset > Status > Edit > Edit computation settings (see below):

    If your dataset comes from HDFS (your case), I advise selecting only the Hive or Impala engine (check with your Hadoop admin if Impala is installed). Note that Impala should be way faster than Hive.

    If your dataset came from a regular filesystem, then indeed the only way for DSS to compute metrics like count of records is to stream the entire file, which can take time.

    Cheers,

    Alex

  • datauser
    datauser Registered Posts: 2 ✭✭✭✭
    Options
    Hello,
    thanks for your answer.

    In the current configuration, i unselect Hive because it is not installed so it generates an error if i check it.
    Stream, SQL and Impala are checked but Impala is not installed too, so i think it uses the Stream mode.
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Options
    Best is to ask your Hadoop admin then, at least for a Hive database configuration. You can point him/her to the documentation: https://doc.dataiku.com/dss/latest/hadoop/hive.html
Setup Info
    Tags
      Help me…