Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I'd like to prepare my data for each partition of a partitioned table using SQL (with either a Spark SQL recipe or a SQL query ; note : our SQL query recipes run on Athena in AWS).
However, dataiku asks the user to explicitly specify the partitions on which the recipe will run:
To avoid laboriously listing all partitions in this field, is there any way to tell dataiku to run the recipe on all partitions (with a keyword such as "ALL" or "*", but those two keywords didn't work out for me)?
Operating system used: AWS EC2
There is no option to build a dataset or run a recipe on all partitions. We aware of this and we have captured this in our backlog. For the time being what you can do to build all partitions:
1) With time range dimension you can use the date range directly in the recipe or scenario, in an example where you have a monthly partition, you can use: e.g 2021-01 / 2021-12. If you want to know when your partitions start/end you can explore the Dataset and under Status, you will be able to sort asc/desc to determine this.
2) For discreet dimensions you can use Python steps in a scenario with an execute python code step:
from dataiku.scenario import Scenario import dataiku #replace dataset name here dataset_name = "bulldozer_small_extract_copy" scenario = Scenario() dataset = dataiku.Dataset(dataset_name) partitions = dataset.list_partitions() # get all partitions from input dataset # for all available partitions in all dimensions partitions_str = ','.join(partitions) # concatenate print("all partitions are:", partitions_str) #when some dimensions are defined but another dimensions requires ALL include in your example partitions you want will start with '2020Q4|Pricing|L4L_Monthly' #partitions_str = ','.join([item for item in partitions if item.startswith('2020Q4|Pricing|L4L_Monthly')]) scenario.build_dataset(dataset_name, partitions=partitions_str)
Let me know if this works for you.
Hi @AlexT , thank you for your answer.
Our team is indeed already using python the way you did in your example to retrieve all partitions, but I was looking for a "native" solution inside an SQL (or Spark SQL) recipe.
I forgot to mention that I am using a discrete dimension (I understand the feature "ALL" is not yet available, I would highly appreciate if it could be released asap 😇).
While we are at it, when using a time dimension, I did not find a syntax allowing to launch a computation by listing several time periods (e.g. : "2019-01/2019-12 + 2021-01/2021/12"). At the time being, it seems we can only launch a computation by listing either specific time partitions or a unique time period (as in your example). This additionnal flexibility would also be appreciated.