Discover this year's submissions to the Dataiku Frontrunner Awards and give kudos to your favorite use cases and success stories!READ MORE

Duplicates generated by recipe

Batpig
Level 1
Duplicates generated by recipe

Hi everyone,

My DSS server exhibits a strange behaviour:

Every table generated from recipe create duplicates: If i enter 100,000 lines table without doing any operations on it, this will result in a 100,000 lines output, but with x duplicates inside instead of original data

Has anyone already experienced the same issue?

Below a simple example:

The input table:

DSS_1.GIF

In / Out recipe

DSS_2.GIF

 

Output with duplicates:

DSS_3.GIF

Best regards,

Baptiste

0 Kudos
7 Replies
JeremieP
Dataiker
Dataiker

Hi Baptiste,

In your Python recipe, can you go to the "Input/Output" tab and check that for your output dataset, the option "Append instead of overwrite" is not activated ?

 

Batpig
Level 1
Author

Hi Jeremie,

Thanks for your answer, indeed the option wasn't activated

Lookslike the problem is with my python env, as the problem don't occur with virtualenvs

Baptiste

0 Kudos
Rubenl92
Level 2

Hi @Batpig 

I have exactly the same issue!

What did you change in your python env to solve this?

I have many packages installed and don't know which is the issue.

Thx

Ruben

0 Kudos
Rubenl92
Level 2

I found out the issue.

 

You need to select "Install mandatory set of packages (you won't be able to use Dataiku APIs without this)"

 

We we're missing the pandas package and that caused the error

tgb417
Neuron
Neuron

@Rubenl92 ,

Thanks for you insights.  I think I have been having a similar problem on one of my DSS instances.  You can find a description of these problems at https://community.dataiku.com/t5/Plugins-Extending-Dataiku/Is-there-a-limit-to-a-directory-structure... .  Does this sound like the problem you are describing?  Data coming out of a python recipient after 100,000 records is being “duplicated” or maybe a better description, overwritten by later data.  

What version of DSS are you all using?  I’ve seen my similar problem with DSS v 9.0.5 on Mac OS.  I’ve also not seen the problem on DSS V10.0.0 or V10.0.2.  

Can you say a bit more about the steps you took to resolve your issue.  I’m fairly sure that I have Pandas and the correct DSS libraries installed in all of the correct places on my instance.  I’m wondering what steps you took to get to your resolution.

Thanks for sharing your insights 

--Tom
0 Kudos
Rubenl92
Level 2

Hi,

I am not sure we have exactly the same issue.

 

I changed my python environment (we have multiple) and used the same python script that was not working before.

The outcome dataset was now suddenly correct. Then I analysed the changes between the python environment. The main difference was the "install mandatory packages" button.

My DSS is version: 9.0.3

Kind regards

0 Kudos
tgb417
Neuron
Neuron

@Rubenl92 

Thanks for sharing further details.

What is/was suggesting to me that we might be looking at a similar situation is that we are trying to write out a dataset from a Python Recipe.  Our data sets seem to get corrupted when we try to export more than 100,000 records.  (My records happen to be about a file system.) However, I've determined that, that part of my code seems to be doing the right thing.  The place I get into problems is when trying to write out > 100,000 records from a Python recipe.  

--Tom

--Tom
0 Kudos