DSS Python pandas heavy datasets problem

Houssam_2000 · January 29

Hello,

i am struggling with a problem in DSS :
- i try to read and process some tables stored in Hive using python pandas, the tables are quite big but i trying to optimize my process, here is what i see on the log :

Columns (53,139) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(f.read())

i get so many lines of this type and they take a long tame to execute, any ideas how to speed up the time of processing ?

Thank you

Operating system used: Linux

Turribeach · January 29

The issue that you have is described in this thread:

https://community.dataiku.com/t5/General-Discussion/get-dataset-loading-strings-as-floats/m-p/40379#M2874

This is a pandas issue not a Dataiku issue. pandas does some data type inference based on the first rows of the dataset. This can lead to pandas chosing a data type which is unsuitable to store all the data on that column. As per the thread I posted you can do my_dataset.get_dataframe(infer_with_pandas=False) to prevent pandas from doing so.

DSS Python pandas heavy datasets problem

Best Answer

Categories

Setup Info

Tags