Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
My goal is to iterate through the different partitions of a dataset, but I'm having trouble accessing the partitions that exist. For more context, I have a set of functions to manipulate the dataframe that is passed through. I would like to loop through each partition and set that to a dataframe that can be passed through the functions.
I tried using the function iter_rows and specifying the partition spec, but I receive an error that the function does not have the argument "partitions". Could you help me understand why the partitions argument is not working and/or an alternative to accessing a partition of a dataset? Is there a way to only choose a partition when running the get_dataframe function?
Thank you!
Hi,
Selecting partitions is done on the Dataset object, not at the time of iterating or getting dataframes:
grp = dataiku.Dataset("mydataset")
grp.add_read_partitions(["1"])
for x in grp.iter_rows():
# This will only retrieve rows of partition 1
do_stuff()
Iterating through pandas dataFrame objects is generally slow. Pandas iteration beats the whole purpose of using DataFrame. It is an anti-pattern and is something you should only do when you have exhausted every other option. It is better look for a List Comprehensions , vectorized solution or DataFrame.apply() method.
Pandas DataFrame loop using list comprehension example
result = [(x, y,z) for x, y,z in zip(df['Name'], df['Promoted'],df['Grade'])]