DSS Python pandas heavy datasets problem

Houssam_2000 · ‎01-29-2024

Hello,

i am struggling with a problem in DSS :
- i try to read and process some tables stored in Hive using python pandas, the tables are quite big but i trying to optimize my process, here is what i see on the log :

Columns (53,139) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(f.read())

i get so many lines of this type and they take a long tame to execute, any ideas how to speed up the time of processing ?

Thank you

Operating system used: Linux

Turribeach · ‎01-29-2024

The issue that you have is described in this thread:

https://community.dataiku.com/t5/General-Discussion/get-dataset-loading-strings-as-floats/m-p/40379#...

This is a pandas issue not a Dataiku issue. pandas does some data type inference based on the first rows of the dataset. This can lead to pandas chosing a data type which is unsuitable to store all the data on that column. As per the thread I posted you can do my_dataset.get_dataframe(infer_with_pandas=False) to prevent pandas from doing so.

View solution in original post

Turribeach · ‎01-29-2024

The issue that you have is described in this thread:

https://community.dataiku.com/t5/General-Discussion/get-dataset-loading-strings-as-floats/m-p/40379#...

This is a pandas issue not a Dataiku issue. pandas does some data type inference based on the first rows of the dataset. This can lead to pandas chosing a data type which is unsuitable to store all the data on that column. As per the thread I posted you can do my_dataset.get_dataframe(infer_with_pandas=False) to prevent pandas from doing so.

Sign up to take part

DSS Python pandas heavy datasets problem

DSS Python pandas heavy datasets problem