DSS Python pandas heavy datasets problem

Level 1
DSS Python pandas heavy datasets problem


i am struggling with a problem in DSS :
- i try to read and process some tables stored in Hive using python pandas, the tables are quite big but i trying to optimize my process, here is what i see on the log :

Columns (53,139) have mixed types.Specify dtype option on import or set low_memory=False.

i get so many lines of this type and they take a long tame to execute, any ideas how to speed up the time of processing ?


Thank you

Operating system used: Linux


0 Kudos
1 Reply

The issue that you have is described in this thread:


This is a pandas issue not a Dataiku issue. pandas does some data type inference based on the first rows of the dataset. This can lead to pandas chosing a data type which is unsuitable to store all the data on that column. As per the thread I posted you can do my_dataset.get_dataframe(infer_with_pandas=False) to prevent pandas from doing so.

0 Kudos