Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
My DSS server exhibits a strange behaviour:
Every table generated from recipe create duplicates: If i enter 100,000 lines table without doing any operations on it, this will result in a 100,000 lines output, but with x duplicates inside instead of original data
Has anyone already experienced the same issue?
Below a simple example:
The input table:
In / Out recipe
Output with duplicates:
Thanks for your answer, indeed the option wasn't activated
Lookslike the problem is with my python env, as the problem don't occur with virtualenvs
I found out the issue.
You need to select "Install mandatory set of packages (you won't be able to use Dataiku APIs without this)"
We we're missing the pandas package and that caused the error
Thanks for you insights. I think I have been having a similar problem on one of my DSS instances. You can find a description of these problems at https://community.dataiku.com/t5/Plugins-Extending-Dataiku/Is-there-a-limit-to-a-directory-structure... . Does this sound like the problem you are describing? Data coming out of a python recipient after 100,000 records is being “duplicated” or maybe a better description, overwritten by later data.
What version of DSS are you all using? I’ve seen my similar problem with DSS v 9.0.5 on Mac OS. I’ve also not seen the problem on DSS V10.0.0 or V10.0.2.
Can you say a bit more about the steps you took to resolve your issue. I’m fairly sure that I have Pandas and the correct DSS libraries installed in all of the correct places on my instance. I’m wondering what steps you took to get to your resolution.
Thanks for sharing your insights
I am not sure we have exactly the same issue.
I changed my python environment (we have multiple) and used the same python script that was not working before.
The outcome dataset was now suddenly correct. Then I analysed the changes between the python environment. The main difference was the "install mandatory packages" button.
My DSS is version: 9.0.3
Thanks for sharing further details.
What is/was suggesting to me that we might be looking at a similar situation is that we are trying to write out a dataset from a Python Recipe. Our data sets seem to get corrupted when we try to export more than 100,000 records. (My records happen to be about a file system.) However, I've determined that, that part of my code seems to be doing the right thing. The place I get into problems is when trying to write out > 100,000 records from a Python recipe.