Discover all of the brand-new features and improvements to existing capabilities in the Dataiku 11.3 updateLET'S GO

Pandas Iter dataframes

Sajid_Khan
Level 3
Pandas Iter dataframes

Hello,

I am trying to load a snowflake table as pandas dataframe. Since the data size is huge, kernel stops and show memory error. What can I do to avoid this? Is there any better way to load huge datasets.

I tried using iter_dataframe functions.

There are two types:

1)iter_dataframes(chunksize=10000infer_with_pandas=Truesampling='head'sampling_column=Noneparse_dates=Truelimit=Noneratio=Nonecolumns=Nonebool_as_str=Falsefloat_precision=None) 

This still shows the memory error,  possible reason is, Pandas tries to detect column data types by the values of those columns. Since my columns have different types of data and its huge, kernel stops and shows memory error. So I tried using the below function.

2)iter_dataframes_forced_types(namesdtypesparse_date_columnschunksize=10000sampling='head'sampling_column=Nonelimit=Noneratio=Nonefloat_precision=None)

In this function, I passes column names and their respective data types as dictionary "{column_name:str}". But information on 4 arguments - names, dtypes, parse_date_columns, chunksize is required so I passed column names as a list for the "names" argument, data types as a list for the "dtypes" argument. (Both lists sorted in a way to match each other). I am not sure what value has to be passed in "parse_date_columns". This is where I am stuck. I tried passing boolean values (True,False), date formats (MM/dd/yyyy HH:mm:ss), None. Nothing worked.

Can anyone direct me towards a better solution or approach?

Thank You,

Sajid


Operating system used: Windows

0 Kudos
1 Reply
Marlan

Hi @Sajid_Khan,

Have you tried iter_dataframes with the infer_with_pandas set False? If not, I'd try that. With that setting, the Snowflake table derived column data types should be used rather than Pandas trying to detect data types. 

Marlan