Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
a doubt: when running a Spark job, is the target table always created as an External Table ? I.e.: does
the job always first make the select, second download the data locally and then load the data into
the External table ?
It depends on the setup. I recommend taking a look at the documentation on Spark Pipelines: https://doc.dataiku.com/dss/latest/spark/pipelines.html
Here is an additional article that you might find helpful: https://knowledge.dataiku.com/latest/kb/data-prep/where-compute-happens.html
Please let us know if you have any further questions.
thank you for your reply, but we're missing a point. We see that computation is actually performed by Spark on the
cluster, but it's when the target table is populated that we see a passage of data through the local DSS.
What we see with a Hive source table, a visual recipe executed with the Spark engine, a tagret Parquet table,
is the following:
1) the computation (actually a SELECT) is performed by Spark;
2) than chunks of data are saved locally on the DSS machine;
3) a CREATE EXTERNAL TABLE is run and table is populated with the chunked data stored locally.
Consider that the source table has almost 900,000,000 records and the target one will have 100,000,000 records;
if the data are streamed locally, the process will take a very long time. We want to avoid this.
If we convert the visual recipe to SparkSQL or HiveQL we see that the time to populate the table is drastically
reduced, but in this case we see that the last step (probably due to a coalesce or repartition) will allocate
just one executor to produce the files for the table.
In the documentation you referenced it's mentioned the compute engine, but we've already seen that this is
Spark; we just want to avoid the step of streaming the data locally on the DSS when using visual recipes.
Alternatively, we'd like to have more executors allocated when files are created using SaprkSQL or Hive.
Hope this clarifies.
In order for Spark jobs to read HDFS datasets directly, you need to make sure that the user running the Spark job has the “Details readable by” permission on the connection.
Having this flag allows the Spark job to access the URI of the HDFS dataset, which permits it to access the filesystem directly. If this flag is not enabled, DSS needs to go to the slow path described below. This will very strongly degrade the performance of the Spark job. For further details, please see the following document: https://doc.dataiku.com/dss/latest/spark/datasets.html#interacting-with-dss-datasets
Depending on the job, executors are dynamically launched and removed by the driver as required. If you are only seeing one executor fulfilled, that indicates that the cluster does not have enough resources to fulfill the entire request for multiple executors.
The flag exists within the data connection settings (Administration >> Connections) under "Security settings" shown below.
Please let us know if you have any additional questions.