SQL on partitioned table - how to launch computation on ALL partitions
Hi,
I'd like to prepare my data for each partition of a partitioned table using SQL (with either a Spark SQL recipe or a SQL query ; note : our SQL query recipes run on Athena in AWS).
However, dataiku asks the user to explicitly specify the partitions on which the recipe will run:
To avoid laboriously listing all partitions in this field, is there any way to tell dataiku to run the recipe on all partitions (with a keyword such as "ALL" or "*", but those two keywords didn't work out for me)?
Operating system used: AWS EC2
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,238 Dataiker
@tanguy
,There is no option to build a dataset or run a recipe on all partitions. We aware of this and we have captured this in our backlog. For the time being what you can do to build all partitions:
1) With time range dimension you can use the date range directly in the recipe or scenario, in an example where you have a monthly partition, you can use: e.g 2021-01 / 2021-12. If you want to know when your partitions start/end you can explore the Dataset and under Status, you will be able to sort asc/desc to determine this.
2) For discreet dimensions you can use Python steps in a scenario with an execute python code step:
from dataiku.scenario import Scenario import dataiku #replace dataset name here dataset_name = "bulldozer_small_extract_copy" scenario = Scenario() dataset = dataiku.Dataset(dataset_name) partitions = dataset.list_partitions() # get all partitions from input dataset # for all available partitions in all dimensions partitions_str = ','.join(partitions) # concatenate print("all partitions are:", partitions_str) #when some dimensions are defined but another dimensions requires ALL include in your example partitions you want will start with '2020Q4|Pricing|L4L_Monthly' #partitions_str = ','.join([item for item in partitions if item.startswith('2020Q4|Pricing|L4L_Monthly')]) scenario.build_dataset(dataset_name, partitions=partitions_str)
Let me know if this works for you.
-
Tanguy Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2023 Posts: 124 Neuron
Hi @AlexT
, thank you for your answer.
Our team is indeed already using python the way you did in your example to retrieve all partitions, but I was looking for a "native" solution inside an SQL (or Spark SQL) recipe.
I forgot to mention that I am using a discrete dimension (I understand the feature "ALL" is not yet available, I would highly appreciate if it could be released asap). While we are at it, when using a time dimension, I did not find a syntax allowing to launch a computation by listing several time periods (e.g. : "2019-01/2019-12 + 2021-01/2021/12"). At the time being, it seems we can only launch a computation by listing either specific time partitions or a unique time period (as in your example). This additionnal flexibility would also be appreciated.