Does a Spark/Hive/Impala job always make a local copy of data ?

gnaldi62
gnaldi62 Partner, L2 Designer, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 79 Neuron

Hi,

a doubt: when running a Spark job, is the target table always created as an External Table ? I.e.: does

the job always first make the select, second download the data locally and then load the data into

the External table ?

Giuseppe

Answers

  • JordanB
    JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 296 Dataiker

    Hi @gnaldi62
    ,

    It depends on the setup. I recommend taking a look at the documentation on Spark Pipelines: https://doc.dataiku.com/dss/latest/spark/pipelines.html

    Here is an additional article that you might find helpful: https://knowledge.dataiku.com/latest/kb/data-prep/where-compute-happens.html

    Please let us know if you have any further questions.

    Thanks!

    Jordan

  • gnaldi62
    gnaldi62 Partner, L2 Designer, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 79 Neuron

    Hi Jordan,

    thank you for your reply, but we're missing a point. We see that computation is actually performed by Spark on the
    cluster, but it's when the target table is populated that we see a passage of data through the local DSS.

    What we see with a Hive source table, a visual recipe executed with the Spark engine, a tagret Parquet table,
    is the following:

    1) the computation (actually a SELECT) is performed by Spark;
    2) than chunks of data are saved locally on the DSS machine;
    3) a CREATE EXTERNAL TABLE is run and table is populated with the chunked data stored locally.

    Consider that the source table has almost 900,000,000 records and the target one will have 100,000,000 records;
    if the data are streamed locally, the process will take a very long time. We want to avoid this.
    If we convert the visual recipe to SparkSQL or HiveQL we see that the time to populate the table is drastically
    reduced, but in this case we see that the last step (probably due to a coalesce or repartition) will allocate
    just one executor to produce the files for the table.
    In the documentation you referenced it's mentioned the compute engine, but we've already seen that this is
    Spark; we just want to avoid the step of streaming the data locally on the DSS when using visual recipes.
    Alternatively, we'd like to have more executors allocated when files are created using SaprkSQL or Hive.
    Hope this clarifies.

    Txs. Rgds.

    Giuseppe

  • JordanB
    JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 296 Dataiker
    Not all tasks have the data go through local files. HDFS and S3 datasets fully benefit from the Spark distributed nature out of the box.

    In order for Spark jobs to read HDFS datasets directly, you need to make sure that the user running the Spark job has the “Details readable by” permission on the connection.

    Having this flag allows the Spark job to access the URI of the HDFS dataset, which permits it to access the filesystem directly. If this flag is not enabled, DSS needs to go to the slow path described below. This will very strongly degrade the performance of the Spark job. For further details, please see the following document: https://doc.dataiku.com/dss/latest/spark/datasets.html#interacting-with-dss-datasets

    Depending on the job, executors are dynamically launched and removed by the driver as required. If you are only seeing one executor fulfilled, that indicates that the cluster does not have enough resources to fulfill the entire request for multiple executors.

    Thanks!

    Jordan

  • gnaldi62
    gnaldi62 Partner, L2 Designer, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 79 Neuron

    Jordan,

    thanks for the clarification. Just a doubt: where is that flag ?

    Giuseppe

  • JordanB
    JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 296 Dataiker

    Hi @gnaldi62
    ,

    The flag exists within the data connection settings (Administration >> Connections) under "Security settings" shown below.

    Screen Shot 2022-12-21 at 1.11.59 PM.png

    Please let us know if you have any additional questions.

    Thanks again!

    Jordan

Setup Info
    Tags
      Help me…