Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi everyone,
My DSS server exhibits a strange behaviour:
Every table generated from recipe create duplicates: If i enter 100,000 lines table without doing any operations on it, this will result in a 100,000 lines output, but with x duplicates inside instead of original data
Has anyone already experienced the same issue?
Below a simple example:
The input table:
In / Out recipe
Output with duplicates:
Best regards,
Baptiste
Hi Baptiste,
In your Python recipe, can you go to the "Input/Output" tab and check that for your output dataset, the option "Append instead of overwrite" is not activated ?
Hi Jeremie,
Thanks for your answer, indeed the option wasn't activated
Lookslike the problem is with my python env, as the problem don't occur with virtualenvs
Baptiste
Hi @Batpig
I have exactly the same issue!
What did you change in your python env to solve this?
I have many packages installed and don't know which is the issue.
Thx
Ruben
I found out the issue.
You need to select "Install mandatory set of packages (you won't be able to use Dataiku APIs without this)"
We we're missing the pandas package and that caused the error
Thanks for you insights. I think I have been having a similar problem on one of my DSS instances. You can find a description of these problems at https://community.dataiku.com/t5/Plugins-Extending-Dataiku/Is-there-a-limit-to-a-directory-structure... . Does this sound like the problem you are describing? Data coming out of a python recipient after 100,000 records is being โduplicatedโ or maybe a better description, overwritten by later data.
What version of DSS are you all using? Iโve seen my similar problem with DSS v 9.0.5 on Mac OS. Iโve also not seen the problem on DSS V10.0.0 or V10.0.2.
Can you say a bit more about the steps you took to resolve your issue. Iโm fairly sure that I have Pandas and the correct DSS libraries installed in all of the correct places on my instance. Iโm wondering what steps you took to get to your resolution.
Thanks for sharing your insights
Hi,
I am not sure we have exactly the same issue.
I changed my python environment (we have multiple) and used the same python script that was not working before.
The outcome dataset was now suddenly correct. Then I analysed the changes between the python environment. The main difference was the "install mandatory packages" button.
My DSS is version: 9.0.3
Kind regards
Thanks for sharing further details.
What is/was suggesting to me that we might be looking at a similar situation is that we are trying to write out a dataset from a Python Recipe. Our data sets seem to get corrupted when we try to export more than 100,000 records. (My records happen to be about a file system.) However, I've determined that, that part of my code seems to be doing the right thing. The place I get into problems is when trying to write out > 100,000 records from a Python recipe.
--Tom