SQL on partitioned table - how to launch computation on ALL partitions

tanguy
SQL on partitioned table - how to launch computation on ALL partitions

Hi,

I'd like to prepare my data for each partition of a partitioned table using SQL (with either a Spark SQL recipe or a SQL query ; note : our SQL query recipes run on Athena in AWS).

However, dataiku asks the user to explicitly specify the partitions on which the recipe will run:

Sans titre.jpg

To avoid laboriously listing all partitions in this field, is there any way to tell dataiku to run the recipe on all partitions (with a keyword such as "ALL" or "*", but those two keywords didn't work out for me)? 


Operating system used: AWS EC2

0 Kudos
2 Replies
AlexT
Dataiker

@tanguy ,

There is no option to build a dataset or run a recipe on all partitions. We aware of this and we have captured this in our backlog. For the time being what you can do to build all partitions:

1) With time range dimension you can use the date range directly in the recipe or scenario, in an example where you have a monthly partition, you can use: e.g  2021-01 / 2021-12.  If you want to know when your partitions start/end you can explore the Dataset and under Status, you will be able to sort asc/desc to determine this.

Screenshot 2021-12-11 at 13.13.22.png

 

2) For discreet dimensions you can use Python steps in a scenario with an execute python code step:

from dataiku.scenario import Scenario
import dataiku

#replace dataset name here

dataset_name = "bulldozer_small_extract_copy"

scenario = Scenario()
dataset = dataiku.Dataset(dataset_name)
partitions = dataset.list_partitions() # get all partitions from input dataset

# for all available partitions in all dimensions 
partitions_str = ','.join(partitions) # concatenate 

print("all partitions are:", partitions_str)
#when some dimensions are defined but another dimensions requires ALL include in your example partitions you want will start with '2020Q4|Pricing|L4L_Monthly'
#partitions_str = ','.join([item for item in partitions if item.startswith('2020Q4|Pricing|L4L_Monthly')])

scenario.build_dataset(dataset_name, partitions=partitions_str)

Screenshot 2021-12-11 at 13.23.10.png

Let me know if this works for you. 

0 Kudos
tanguy
Author

Hi @AlexT , thank you for your answer.

Our team is indeed already using python the way you did in your example to retrieve all partitions, but I was looking for a "native" solution inside an SQL (or Spark SQL) recipe.

I forgot to mention that I am using a discrete dimension (I understand the feature "ALL" is not yet available, I would highly appreciate if it could be released asap ๐Ÿ˜‡).

While we are at it, when using a time dimension, I did not find a syntax allowing to launch a computation by listing several time periods (e.g. : "2019-01/2019-12 + 2021-01/2021/12"). At the time being, it seems we can only launch a computation by listing either specific time partitions or a unique time period (as in your example). This additionnal flexibility would also be appreciated.

0 Kudos