Reading partitions one at a time from Python

ankitmat45
ankitmat45 Registered Posts: 1 ✭✭✭✭

Hi, I am trying to read a partitioned dataset using Python. I got a list of partitions using the following code. But I do not know how to read those partitions one by one as dataframe.

mydataset = dataiku.Dataset("Data")
mydataset_df = mydataset.list_partitions(raise_if_empty=True)

Kindly suggest

Thanks

Answers

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    edited July 17

    Hi,

    in a notebook, you'd go like:

    import dataiku
    ds = dataiku.Dataset('Data')
    for p in ds.list_partitions():
        ds.read_partitions = [p]
        df = ds.get_dataframe()
        print(df.shape)

    Note that in a recipe, the read_partitions field will be filled by DSS automatically when you create the Dataset object.

    Regards,

    Frederic

  • Nicolas_Servel
    Nicolas_Servel Dataiker Posts: 37 Dataiker
    edited July 17

    Hello,

    First, are you trying to achieve that from a python recipe (in the flow), or from a notebook ? If you are in a python recipe, you usually cannot handle yourself partitions, as they are managed in the recipe parameters. However, you can override that by setting the "ignore_flow" field to "True" in every occurence of "Dataset". It would look like

    import dataiku
    my_dataset = dataiku.Dataset("my_dataset_name", ignore_flow=True)

    Then, the solution to read only some partitions of a dataset is to leverage the "add_read_partitions" function from the "Dataset" class. However, this functions only "adds" the partition, meaning, it does not remove the previously added ones. Therefore, each time you call it, you need to reset th read partition by using a new instance of Dataset. This would look like:

    import dataiku
    partitions = dataiku.Dataset("Data").list_partitions(raise_if_empty=True)
    dataset_partitions_df = {}
    for partition in partitions:
        dataset = dataiku.Dataset("Data")  # reinitializing the read partition
        dataset.add_read_partitions(partition)
        dataset_partition_df = dataset.get_dataframe()
        dataset_partitions_df[partition] = dataset_partition_df
    print(dataset_partitions_df)
    # => outputs {"part1": .. df of part1 .., "part2": .. df of part2 .., ...}

    Hope this helps

    Best regards

  • Skanda Gurunathan
    Skanda Gurunathan Registered Posts: 8 ✭✭✭

    Why is the list_partitions() taking a lot of time, how does internally dataiku works on this?

    Say I want to read only the latest partition, is there anyway to get that quickly?

Setup Info
    Tags
      Help me…