Dataiku visual Recipe Parallel
I am using Dataiku 12.5.2. How can I enable parallel processing when using a Sync recipe for the following cases:
- Filesystem Dataset to Filesystem Dataset
- JDBC Dataset to JDBC Dataset
- Or between Filesystem Dataset and JDBC Dataset?
Are the only available options duplicating the flow, partitioning the data, or using code recipes?
Operating system used: centos
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,090 Neuron
Those are indeed your options but you are looking at this from the wrong angle I think. Neither JDBC nor File System Datasets support parallel processing natively. Note that partitioning in Dataiku is not really aimed at better performance but at enabling you to process specific data partitions at a time without having to reprocess all data. There are lots of caveats around using partitions in Dataiku.
So you should be looking at moving to technologies that support parallel processing natively like Databricks, Spark or Kubernetes.
-
SangHoon Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 15 ✭✭✭
Thank you so much for the perfect answer
-
WoW Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 4 ✭✭✭
Hello,
I also have a similar question.
When using a visual recipe, it seems that you can set the prepare recipe to Thread and Parallelism, but for other recipes, Thread settings are not available.
Additionally, it seems that the setting isn't being applied, so I'd like to know how to apply it.
Thank you.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,090 Neuron
As per the documentation "DSS creates an activity per dataset per partition" so the setting you refer to only get used on partitioned datasets. Non-partitioned datasets don't run concurrent activities which is the whole point of this post.
https://doc.dataiku.com/dss/latest/flow/limits.html#limiting-concurrent-executions