Does a Spark/Hive/Impala job always make a local copy of data ?
Hi,
a doubt: when running a Spark job, is the target table always created as an External Table ? I.e.: does
the job always first make the select, second download the data locally and then load the data into
the External table ?
Giuseppe
Answers
-
JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 297 Dataiker
Hi @gnaldi62
,It depends on the setup. I recommend taking a look at the documentation on Spark Pipelines: https://doc.dataiku.com/dss/latest/spark/pipelines.html
Here is an additional article that you might find helpful: https://knowledge.dataiku.com/latest/kb/data-prep/where-compute-happens.html
Please let us know if you have any further questions.
Thanks!
Jordan
-
gnaldi62 Partner, L2 Designer, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 79 Neuron
Hi Jordan,
thank you for your reply, but we're missing a point. We see that computation is actually performed by Spark on the
cluster, but it's when the target table is populated that we see a passage of data through the local DSS.What we see with a Hive source table, a visual recipe executed with the Spark engine, a tagret Parquet table,
is the following:1) the computation (actually a SELECT) is performed by Spark;
2) than chunks of data are saved locally on the DSS machine;
3) a CREATE EXTERNAL TABLE is run and table is populated with the chunked data stored locally.Consider that the source table has almost 900,000,000 records and the target one will have 100,000,000 records;
if the data are streamed locally, the process will take a very long time. We want to avoid this.
If we convert the visual recipe to SparkSQL or HiveQL we see that the time to populate the table is drastically
reduced, but in this case we see that the last step (probably due to a coalesce or repartition) will allocate
just one executor to produce the files for the table.
In the documentation you referenced it's mentioned the compute engine, but we've already seen that this is
Spark; we just want to avoid the step of streaming the data locally on the DSS when using visual recipes.
Alternatively, we'd like to have more executors allocated when files are created using SaprkSQL or Hive.
Hope this clarifies.Txs. Rgds.
Giuseppe
-
JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 297 DataikerHi @gnaldi62
,Not all tasks have the data go through local files. HDFS and S3 datasets fully benefit from the Spark distributed nature out of the box.In order for Spark jobs to read HDFS datasets directly, you need to make sure that the user running the Spark job has the “Details readable by” permission on the connection.
Having this flag allows the Spark job to access the URI of the HDFS dataset, which permits it to access the filesystem directly. If this flag is not enabled, DSS needs to go to the slow path described below. This will very strongly degrade the performance of the Spark job. For further details, please see the following document: https://doc.dataiku.com/dss/latest/spark/datasets.html#interacting-with-dss-datasets
Depending on the job, executors are dynamically launched and removed by the driver as required. If you are only seeing one executor fulfilled, that indicates that the cluster does not have enough resources to fulfill the entire request for multiple executors.
Thanks!
Jordan
-
gnaldi62 Partner, L2 Designer, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 79 Neuron
Jordan,
thanks for the clarification. Just a doubt: where is that flag ?
Giuseppe
-
JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 297 Dataiker
Hi @gnaldi62
,The flag exists within the data connection settings (Administration >> Connections) under "Security settings" shown below.
Please let us know if you have any additional questions.
Thanks again!
Jordan