DSS Python pandas heavy datasets problem
Hello,
i am struggling with a problem in DSS :
- i try to read and process some tables stored in Hive using python pandas, the tables are quite big but i trying to optimize my process, here is what i see on the log :
Columns (53,139) have mixed types.Specify dtype option on import or set low_memory=False. exec(f.read())
i get so many lines of this type and they take a long tame to execute, any ideas how to speed up the time of processing ?
Thank you
Operating system used: Linux
Best Answer
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,072 Neuron
The issue that you have is described in this thread:
This is a pandas issue not a Dataiku issue. pandas does some data type inference based on the first rows of the dataset. This can lead to pandas chosing a data type which is unsuitable to store all the data on that column. As per the thread I posted you can do my_dataset.get_dataframe(infer_with_pandas=False) to prevent pandas from doing so.