SQL on partitioned table - how to launch computation on ALL partitions

Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2023 Posts: 134 Neuron

Hi,

I'd like to prepare my data for each partition of a partitioned table using SQL (with either a Spark SQL recipe or a SQL query ; note : our SQL query recipes run on Athena in AWS).

However, dataiku asks the user to explicitly specify the partitions on which the recipe will run:

Sans titre.jpg

To avoid laboriously listing all partitions in this field, is there any way to tell dataiku to run the recipe on all partitions (with a keyword such as "ALL" or "*", but those two keywords didn't work out for me)?


Operating system used: AWS EC2

Answers

  • Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,254 Dataiker
    edited July 2024

    @tanguy
    ,

    There is no option to build a dataset or run a recipe on all partitions. We aware of this and we have captured this in our backlog. For the time being what you can do to build all partitions:

    1) With time range dimension you can use the date range directly in the recipe or scenario, in an example where you have a monthly partition, you can use: e.g 2021-01 / 2021-12. If you want to know when your partitions start/end you can explore the Dataset and under Status, you will be able to sort asc/desc to determine this.

    Screenshot 2021-12-11 at 13.13.22.png

    2) For discreet dimensions you can use Python steps in a scenario with an execute python code step:

    from dataiku.scenario import Scenario
    import dataiku
    
    #replace dataset name here
    
    dataset_name = "bulldozer_small_extract_copy"
    
    scenario = Scenario()
    dataset = dataiku.Dataset(dataset_name)
    partitions = dataset.list_partitions() # get all partitions from input dataset
    
    # for all available partitions in all dimensions 
    partitions_str = ','.join(partitions) # concatenate 
    
    print("all partitions are:", partitions_str)
    #when some dimensions are defined but another dimensions requires ALL include in your example partitions you want will start with '2020Q4|Pricing|L4L_Monthly'
    #partitions_str = ','.join([item for item in partitions if item.startswith('2020Q4|Pricing|L4L_Monthly')])
    
    scenario.build_dataset(dataset_name, partitions=partitions_str)

    Screenshot 2021-12-11 at 13.23.10.png

    Let me know if this works for you.

  • Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2023 Posts: 134 Neuron

    Hi @AlexT
    , thank you for your answer.

    Our team is indeed already using python the way you did in your example to retrieve all partitions, but I was looking for a "native" solution inside an SQL (or Spark SQL) recipe.

    I forgot to mention that I am using a discrete dimension (I understand the feature "ALL" is not yet available, I would highly appreciate if it could be released asap ).

    While we are at it, when using a time dimension, I did not find a syntax allowing to launch a computation by listing several time periods (e.g. : "2019-01/2019-12 + 2021-01/2021/12"). At the time being, it seems we can only launch a computation by listing either specific time partitions or a unique time period (as in your example). This additionnal flexibility would also be appreciated.

Welcome!

It looks like you're new here. Sign in or register to get started.

Welcome!

It looks like you're new here. Sign in or register to get started.