Currently, a few recipes support building all partitions. However, other recipes require users to manually list the desired partitions to build. While it's possible to write a script that generates such a list from all the column values, it's time consuming to repeat this step for every recipe, especially when the partitions list changes frequently and the partition values need to be updated downstream. In the sync recipe, it's possible to select "all available" as the partitions to build setting. It would be great if this option were available for all recipes. This would enable multithreaded execution of complex flows.
With Teradata datasets, many large tables simply don't fit into spool space due to the highly distributed nature of Teradata tables (often users have less than 1 GB of memory on any given Teradata AMP (node)). Queries often need to be filtered before it's possible to successfully execute them at all. As a result, to operate against such datasets in Dataiku, it's usually necessary to generate partitions against one or more columns to reduce the memory consumed by any given query. This can create thousands of partitions which are updated regularly as the underlying data changes. With this feature, I could quickly execute Dataiku flows against this type of large dataset and also enjoy faster execution with more detailed progress in the job logs for each recipe on other datasets that would benefit from partitioning. The current configuration makes partitioning inconvenient for most applications, since it's rare I want to run a flow against only one partition value. It would also be ideal if I could set the partition values across an entire pipeline, allowing me to quickly test against a single partition, then change back to all partitions once I'm ready to execute.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.