Accessing Partition of Dataset in Python Recipe

Options
DV
DV Partner, Registered Posts: 1 Partner

My goal is to iterate through the different partitions of a dataset, but I'm having trouble accessing the partitions that exist. For more context, I have a set of functions to manipulate the dataframe that is passed through. I would like to loop through each partition and set that to a dataframe that can be passed through the functions.

I tried using the function iter_rows and specifying the partition spec, but I receive an error that the function does not have the argument "partitions". Could you help me understand why the partitions argument is not working and/or an alternative to accessing a partition of a dataset? Is there a way to only choose a partition when running the get_dataframe function?

DV_1-1579216650469.png

DV_3-1579216840816.png

Thank you!

Answers

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    edited July 17
    Options

    Hi,

    Selecting partitions is done on the Dataset object, not at the time of iterating or getting dataframes:

    grp = dataiku.Dataset("mydataset")
    
    grp.add_read_partitions(["1"])
    
    for x in grp.iter_rows():
        # This will only retrieve rows of partition 1
        do_stuff()

  • carlhyde
    carlhyde Registered Posts: 2 ✭✭✭✭
    Options

    Iterating through pandas dataFrame objects is generally slow. Pandas iteration beats the whole purpose of using DataFrame. It is an anti-pattern and is something you should only do when you have exhausted every other option. It is better look for a List Comprehensions , vectorized solution or DataFrame.apply() method.

    Pandas DataFrame loop using list comprehension example

    result = [(x, y,z) for x, y,z in zip(df['Name'], df['Promoted'],df['Grade'])]

Setup Info
    Tags
      Help me…