Using DASK for merging in Dataiku
I am trying to join two big dataframes in Dataiku, using Dask, since Pandas gives the dead kernel error. However using Dask gives me this error: ' AttributeError: BlockManager' object has no attribute 'arrays' which seems to be an internal Dask error.
Operating system used: Windows
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,215 Dataiker
Hi @oou
,
The error seems to be related to potentially incompatible pandas/dask.
Can you please share the code you are receiving and version of Pandas/Dask.
Note if the you are getting kernel error it's most likely your Join requires a lot of memory and you don't have sufficient memory.
Unclear if doing the same inDask will help. You could try to use Spark instead.
Or try using DSS engine /Visual Join recipe instead.
Answers
-
Thank Alex, that is correct,
I've switched to using SparkSQL, and thank you for your response. The process turned out to be quite extensive. After completing it, I had to rerun the SQL query. Assuming that Dataiku would overwrite the existing file, I didn't alter the output name. However, I've encountered a new error when attempting to read the file from Dataiku into Jupyter Notebook: "ValueError: Duplicate names are not allowed." I've verified the column names in the generated data, and there are no duplicates. It seems that somewhere within Dataiku, the data isn't being properly overwritten. Any thoughts on this?