Pandas Iter dataframes

Options
Sajid_Khan
Sajid_Khan Partner, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 12 Partner

Hello,

I am trying to load a snowflake table as pandas dataframe. Since the data size is huge, kernel stops and show memory error. What can I do to avoid this? Is there any better way to load huge datasets.

I tried using iter_dataframe functions.

There are two types:

1)iter_dataframes(chunksize=10000, infer_with_pandas=True, sampling='head', sampling_column=None, parse_dates=True, limit=None, ratio=None, columns=None, bool_as_str=False, float_precision=None)

This still shows the memory error, possible reason is, Pandas tries to detect column data types by the values of those columns. Since my columns have different types of data and its huge, kernel stops and shows memory error. So I tried using the below function.

2)iter_dataframes_forced_types(names, dtypes, parse_date_columns, chunksize=10000, sampling='head', sampling_column=None, limit=None, ratio=None, float_precision=None)

In this function, I passes column names and their respective data types as dictionary "{column_name:str}". But information on 4 arguments - names, dtypes, parse_date_columns, chunksize is required so I passed column names as a list for the "names" argument, data types as a list for the "dtypes" argument. (Both lists sorted in a way to match each other). I am not sure what value has to be passed in "parse_date_columns". This is where I am stuck. I tried passing boolean values (True,False), date formats (MM/dd/yyyy HH:mm:ss), None. Nothing worked.

Can anyone direct me towards a better solution or approach?

Thank You,

Sajid


Operating system used: Windows

Answers

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 317 Neuron
    Options

    Hi @Sajid_Khan
    ,

    Have you tried iter_dataframes with the infer_with_pandas set False? If not, I'd try that. With that setting, the Snowflake table derived column data types should be used rather than Pandas trying to detect data types.

    Marlan

Setup Info
    Tags
      Help me…