Does a Spark/Hive/Impala job always make a local copy of data ?

gnaldi62
Does a Spark/Hive/Impala job always make a local copy of data ?

Hi,

  a doubt: when running a Spark job, is the target table always created as an External Table ? I.e.: does

  the job always first make the select, second download the data locally and then load the data into

  the External table ?

Giuseppe

0 Kudos
5 Replies
JordanB
Dataiker

Hi @gnaldi62,

It depends on the setup. I recommend taking a look at the documentation on Spark Pipelines: https://doc.dataiku.com/dss/latest/spark/pipelines.html

Here is an additional article that you might find helpful: https://knowledge.dataiku.com/latest/kb/data-prep/where-compute-happens.html

Please let us know if you have any further questions.

Thanks!

Jordan

 

0 Kudos
gnaldi62
Author

Hi Jordan,

  thank you for your reply, but we're missing a point. We see that computation is actually performed by Spark on the
  cluster, but it's when the target table is populated that we see a passage of data through the local DSS.

  What we see with a Hive source table, a visual recipe executed with the Spark engine, a tagret Parquet table,
  is the following:

  1) the computation (actually a SELECT) is performed by Spark;
  2) than chunks of data are saved locally on the DSS machine;
  3) a CREATE EXTERNAL TABLE is run and table is populated with the chunked data stored locally.

  Consider that the source table has almost 900,000,000 records and the target one will have 100,000,000 records;
  if the data are streamed locally, the process will take a very long time. We want to avoid this. 
  If we convert the visual recipe to SparkSQL or HiveQL we see that the time to populate the table is drastically
  reduced, but in this case we see that the last step (probably due to a coalesce or repartition) will allocate
  just one executor to produce the files for the table.
  In the documentation you referenced it's mentioned the compute engine, but we've already seen that this is
  Spark; we just want to avoid the step of streaming the data locally on the DSS when using visual recipes.
  Alternatively, we'd like to have more executors allocated when files are created using SaprkSQL or Hive.
  Hope this clarifies.

Txs. Rgds.

  Giuseppe

0 Kudos
JordanB
Dataiker
 
Not all tasks have the data go through local files. HDFS and S3 datasets fully benefit from the Spark distributed nature out of the box.
 

In order for Spark jobs to read HDFS datasets directly, you need to make sure that the user running the Spark job has the โ€œDetails readable byโ€ permission on the connection. 

Having this flag allows the Spark job to access the URI of the HDFS dataset, which permits it to access the filesystem directly. If this flag is not enabled, DSS needs to go to the slow path described below. This will very strongly degrade the performance of the Spark job. For further details, please see the following document: https://doc.dataiku.com/dss/latest/spark/datasets.html#interacting-with-dss-datasets

Depending on the job, executors are dynamically launched and removed by the driver as required. If you are only seeing one executor fulfilled, that indicates that the cluster does not have enough resources to fulfill the entire request for multiple executors. 

Thanks!

Jordan

0 Kudos
gnaldi62
Author

Jordan,

  thanks for the clarification. Just a doubt: where is that flag ?

Giuseppe

0 Kudos
JordanB
Dataiker

Hi @gnaldi62,

The flag exists within the data connection settings (Administration >> Connections) under "Security settings" shown below. 

Screen Shot 2022-12-21 at 1.11.59 PM.png

Please let us know if you have any additional questions.

Thanks again!

Jordan

0 Kudos