Build all partitions for all recipes
Currently, a few recipes support building all partitions. However, other recipes require users to manually list the desired partitions to build. While it's possible to write a script that generates such a list from all the column values, it's time consuming to repeat this step for every recipe, especially when the partitions list changes frequently and the partition values need to be updated downstream. In the sync recipe, it's possible to select "all available" as the partitions to build setting. It would be great if this option were available for all recipes. This would enable multithreaded execution of complex flows.
With Teradata datasets, many large tables simply don't fit into spool space due to the highly distributed nature of Teradata tables (often users have less than 1 GB of memory on any given Teradata AMP (node)). Queries often need to be filtered before it's possible to successfully execute them at all. As a result, to operate against such datasets in Dataiku, it's usually necessary to generate partitions against one or more columns to reduce the memory consumed by any given query. This can create thousands of partitions which are updated regularly as the underlying data changes. With this feature, I could quickly execute Dataiku flows against this type of large dataset and also enjoy faster execution with more detailed progress in the job logs for each recipe on other datasets that would benefit from partitioning. The current configuration makes partitioning inconvenient for most applications, since it's rare I want to run a flow against only one partition value. It would also be ideal if I could set the partition values across an entire pipeline, allowing me to quickly test against a single partition, then change back to all partitions once I'm ready to execute.
Comments
-
Ashley Dataiker, Alpha Tester, Dataiku DSS Core Designer, Registered, Product Ideas Manager Posts: 162 Dataiker
Thanks for your idea, @natejgardner
Your idea meets the criteria for submission, we'll reach out should we require more information.
If you’re reading this and would love to see this improvement in the partitioning experience, be sure to kudos the original post or leave a comment!
Take care,
Ashley
-
has this been looked at? i am also trying to read a flat file and write it to a partitioned dataset using pyspark and dont know what to write in the input text box for partition value to indicate that i want to rebuild all partitions