Selective reading from a dataset into a pandas dataframe

SuhailS7
SuhailS7 Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 4

I have a big dataiku dataset in the flow which I am then reading into a pandas dataframe using the get_dataframe() method. The problem happens when I have to read 4 such datasets into memory which slows down the processing. I don't need to read the entire dataset, only a subset has to be read into the pandas dataframe. Is there a way I can do this?

Currently, I am reading the entire data into pandas and then slicing the pandas dataframe and I would like to avoid doing this.

Thanks!

Answers

  • nmadhu20
    nmadhu20 Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 35 Neuron
    edited July 17

    Hi @SuhailS7
    ,

    Yes, you can easily get the desired number of rows with the help of the below code.

    mydataset = Dataset("myname")
    
    for df in mydataset.iter_dataframes(chunksize=10000)
            # df is a dataframe of at most 10K rows.

    You can refer to the following doc for more info - https://doc.dataiku.com/dss/latest/python-api/datasets-data.html

    Best,

    Madhuleena

  • SuhailS7
    SuhailS7 Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 4

    Thanks @nmadhu20
    but what if I need a specific number of rows for my processing? For eg - I need the rows that only contain a specific value. If the dataiku dataset has 1,000,000 records, there may be only 500 rows that satisfy the condition I am looking for and I don't want to read all 1,000,000 rows into a pandas dataframe for this purpose.

    Currently I am performing 2 operations, loading it into memory and then slicing the dataframe, I was hoping if it is possible to do this in one operation.

  • nmadhu20
    nmadhu20 Neuron, Registered, Neuron 2022, Neuron 2023 Posts: 35 Neuron
    edited July 17

    To my knowledge, you can use the iter_rows() function of the Dataset object to iterate over your rows without loading the entire data frame into memory as you would not need get_dataframe(). Then you can go ahead with the required filter but this involves the processing to be at each row-level.

    iter_rows(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=- 1, timeout=30, columns=None)

    mydataset = Dataset("myname")
    
    for row in mydataset.iter_rows()
    #required filter on a row basis

    This returns each row as a dictionary. More info in this link - https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html

    Hope this helps!

    Madhuleena

Setup Info
    Tags
      Help me…