Deleting Partitions w API

cmjurs · ‎07-13-2023

Im partitioning a table by date and using it as an archive. Ive written code to extract a list of partitions to remove (because they are old), but cant find the method that can take this list and remove the partitions from the existing table.

Here are the current partitions in the data (each partition is like 22M rows and 15 columns).. so fairly big

archive = dataiku.Dataset("archive")
archive.list_partitions()
>>>['2023-07-13',
 '2023-06-17',
 '2023-05-12',
 '2023-04-11',
 '2023-03-14',
 '2023-03-12']

Id like to remove the following:

old_dates = ['2023-04-11', '2023-03-14', '2023-03-12']

Id expect a method like:

archive.clear_partitions(old_dates), but it doesnt exist

Id like to avoid having to open the entire archive and filter resave... that would be heavy

Any ideas of how I can accomplish this?

Thanks

CJ

Operating system used: ubuntu

SarinaS · ‎07-14-2023

Hi @cmjurs,

The dataset.clear() method can take in a list of partitions:
https://developer.dataiku.com/latest/api-reference/python/datasets.html#dataikuapi.dss.dataset.DSSDa...

For example:

import dataiku
client = dataiku.api_client()
project = client.get_default_project()
ds = project.get_dataset('archive')

ds.clear(partitions=['2023-04-11', '2023-03-14', '2023-03-12'])

Thanks!
Sarina

View solution in original post

SarinaS · ‎07-14-2023

Hi @cmjurs,

The dataset.clear() method can take in a list of partitions:
https://developer.dataiku.com/latest/api-reference/python/datasets.html#dataikuapi.dss.dataset.DSSDa...

For example:

import dataiku
client = dataiku.api_client()
project = client.get_default_project()
ds = project.get_dataset('archive')

ds.clear(partitions=['2023-04-11', '2023-03-14', '2023-03-12'])

Thanks!
Sarina

cmjurs · ‎07-14-2023

Thanks @SarinaS! Perfect

Follow up question, Is there a way to read partitions using the ds object you defined above? Ive found that even though the following method works inside a notebook

archive = dataiku.Dataset("archive")
archive.add_read_partitions(partition)
archive.get_dataframe()

It does NOT work running it from the flow. Get:

Job failed: Error in Python process: At line 91: <class 'Exception'>: You cannot explicitly set partitions when running within Dataiku Flow

Update: using a python script in scenarios seems to work. Is this the right way to do this?

SarinaS · ‎07-26-2023

Hi @cmjurs,

Apologies for my delay! Indeed your understanding is correct. When running Python from the flow, the partitions to read are defined by the partition identifiers in the recipe:

In order to define the partitions via code, using a custom Python step in a scenario is indeed the way to do so.

Thank you,
Sarina

Sign up to take part

Deleting Partitions w API

Deleting Partitions w API

Job failed: Error in Python process: At line 91: <class 'Exception'>: You cannot explicitly set partitions when running within Dataiku Flow

Setup info