Using DASK for merging in Dataiku

Solved!
oou
Level 1
Using DASK for merging in Dataiku

I am trying to join two big dataframes in Dataiku, using Dask, since Pandas gives the dead kernel error. However using Dask gives me this error: ' AttributeError: BlockManager' object has no attribute 'arrays' which seems to be an internal Dask error.


Operating system used: Windows

0 Kudos
1 Solution
AlexT
Dataiker

Hi @oou ,
The error seems to be related to potentially incompatible pandas/dask.
Can you please share the code you are receiving and version of Pandas/Dask.

Note if the you are getting kernel error it's most likely your Join requires a lot of memory and you don't have sufficient memory. 

Unclear if doing the same inDask will help. You could try to use Spark instead.
Or try using DSS engine /Visual Join recipe instead.

View solution in original post

0 Kudos
2 Replies
AlexT
Dataiker

Hi @oou ,
The error seems to be related to potentially incompatible pandas/dask.
Can you please share the code you are receiving and version of Pandas/Dask.

Note if the you are getting kernel error it's most likely your Join requires a lot of memory and you don't have sufficient memory. 

Unclear if doing the same inDask will help. You could try to use Spark instead.
Or try using DSS engine /Visual Join recipe instead.

0 Kudos
oou
Level 1
Author

Thank Alex, that is correct,

I've switched to using SparkSQL, and thank you for your response. The process turned out to be quite extensive. After completing it, I had to rerun the SQL query. Assuming that Dataiku would overwrite the existing file, I didn't alter the output name. However, I've encountered a new error when attempting to read the file from Dataiku into Jupyter Notebook: "ValueError: Duplicate names are not allowed." I've verified the column names in the generated data, and there are no duplicates. It seems that somewhere within Dataiku, the data isn't being properly overwritten. Any thoughts on this?

0 Kudos

Labels

?
Labels (1)
A banner prompting to get Dataiku