SQL on partitioned table - how to launch computation on ALL partitions

Options
Tanguy
Tanguy Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2023 Posts: 112 Neuron

Hi,

I'd like to prepare my data for each partition of a partitioned table using SQL (with either a Spark SQL recipe or a SQL query ; note : our SQL query recipes run on Athena in AWS).

However, dataiku asks the user to explicitly specify the partitions on which the recipe will run:

Sans titre.jpg

To avoid laboriously listing all partitions in this field, is there any way to tell dataiku to run the recipe on all partitions (with a keyword such as "ALL" or "*", but those two keywords didn't work out for me)?


Operating system used: AWS EC2

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    edited July 17
    Options

    @tanguy
    ,

    There is no option to build a dataset or run a recipe on all partitions. We aware of this and we have captured this in our backlog. For the time being what you can do to build all partitions:

    1) With time range dimension you can use the date range directly in the recipe or scenario, in an example where you have a monthly partition, you can use: e.g 2021-01 / 2021-12. If you want to know when your partitions start/end you can explore the Dataset and under Status, you will be able to sort asc/desc to determine this.

    Screenshot 2021-12-11 at 13.13.22.png

    2) For discreet dimensions you can use Python steps in a scenario with an execute python code step:

    from dataiku.scenario import Scenario
    import dataiku
    
    #replace dataset name here
    
    dataset_name = "bulldozer_small_extract_copy"
    
    scenario = Scenario()
    dataset = dataiku.Dataset(dataset_name)
    partitions = dataset.list_partitions() # get all partitions from input dataset
    
    # for all available partitions in all dimensions 
    partitions_str = ','.join(partitions) # concatenate 
    
    print("all partitions are:", partitions_str)
    #when some dimensions are defined but another dimensions requires ALL include in your example partitions you want will start with '2020Q4|Pricing|L4L_Monthly'
    #partitions_str = ','.join([item for item in partitions if item.startswith('2020Q4|Pricing|L4L_Monthly')])
    
    scenario.build_dataset(dataset_name, partitions=partitions_str)

    Screenshot 2021-12-11 at 13.23.10.png

    Let me know if this works for you.

  • Tanguy
    Tanguy Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Dataiku DSS Developer, Neuron 2023 Posts: 112 Neuron
    Options

    Hi @AlexT
    , thank you for your answer.

    Our team is indeed already using python the way you did in your example to retrieve all partitions, but I was looking for a "native" solution inside an SQL (or Spark SQL) recipe.

    I forgot to mention that I am using a discrete dimension (I understand the feature "ALL" is not yet available, I would highly appreciate if it could be released asap ).

    While we are at it, when using a time dimension, I did not find a syntax allowing to launch a computation by listing several time periods (e.g. : "2019-01/2019-12 + 2021-01/2021/12"). At the time being, it seems we can only launch a computation by listing either specific time partitions or a unique time period (as in your example). This additionnal flexibility would also be appreciated.

Setup Info
    Tags
      Help me…