DSS Python pandas heavy datasets problem

Houssam_2000
Level 1
DSS Python pandas heavy datasets problem

Hello,

i am struggling with a problem in DSS :
- i try to read and process some tables stored in Hive using python pandas, the tables are quite big but i trying to optimize my process, here is what i see on the log :

Columns (53,139) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(f.read())


i get so many lines of this type and they take a long tame to execute, any ideas how to speed up the time of processing ?

 

Thank you


Operating system used: Linux

 

0 Kudos
1 Reply
Turribeach

The issue that you have is described in this thread:

https://community.dataiku.com/t5/General-Discussion/get-dataset-loading-strings-as-floats/m-p/40379#...

This is a pandas issue not a Dataiku issue. pandas does some data type inference based on the first rows of the dataset. This can lead to pandas chosing a data type which is unsuitable to store all the data on that column. As per the thread I posted you can do my_dataset.get_dataframe(infer_with_pandas=False) to prevent pandas from doing so.

0 Kudos