Duplicates generated by recipe

Batpig · March 2021

Hi everyone,

My DSS server exhibits a strange behaviour:

Every table generated from recipe create duplicates: If i enter 100,000 lines table without doing any operations on it, this will result in a 100,000 lines output, but with x duplicates inside instead of original data

Has anyone already experienced the same issue?

Below a simple example:

The input table:

In / Out recipe

Output with duplicates:

Best regards,

Baptiste

JeremieP · March 2021

Hi Baptiste,

In your Python recipe, can you go to the "Input/Output" tab and check that for your output dataset, the option "Append instead of overwrite" is not activated ?

Batpig · March 2021

Hi Jeremie,

Thanks for your answer, indeed the option wasn't activated

Lookslike the problem is with my python env, as the problem don't occur with virtualenvs

Baptiste

Rubenl92 · December 2021

Hi @Batpig

I have exactly the same issue!

What did you change in your python env to solve this?

I have many packages installed and don't know which is the issue.

Thx

Ruben

Rubenl92 · December 2021

I found out the issue.

You need to select "Install mandatory set of packages (you won't be able to use Dataiku APIs without this)"

We we're missing the pandas package and that caused the error

tgb417 · December 2021

@Rubenl92
,

Thanks for you insights. I think I have been having a similar problem on one of my DSS instances. You can find a description of these problems at https://community.dataiku.com/t5/Plugins-Extending-Dataiku/Is-there-a-limit-to-a-directory-structure-that-list-paths-in/m-p/21317#M1330 . Does this sound like the problem you are describing? Data coming out of a python recipient after 100,000 records is being “duplicated” or maybe a better description, overwritten by later data.

What version of DSS are you all using? I’ve seen my similar problem with DSS v 9.0.5 on Mac OS. I’ve also not seen the problem on DSS V10.0.0 or V10.0.2.

Can you say a bit more about the steps you took to resolve your issue. I’m fairly sure that I have Pandas and the correct DSS libraries installed in all of the correct places on my instance. I’m wondering what steps you took to get to your resolution.

Thanks for sharing your insights

Rubenl92 · December 2021

Hi,

I am not sure we have exactly the same issue.

I changed my python environment (we have multiple) and used the same python script that was not working before.

The outcome dataset was now suddenly correct. Then I analysed the changes between the python environment. The main difference was the "install mandatory packages" button.

My DSS is version: 9.0.3

Kind regards

tgb417 · December 2021

@Rubenl92

Thanks for sharing further details.

What is/was suggesting to me that we might be looking at a similar situation is that we are trying to write out a dataset from a Python Recipe. Our data sets seem to get corrupted when we try to export more than 100,000 records. (My records happen to be about a file system.) However, I've determined that, that part of my code seems to be doing the right thing. The place I get into problems is when trying to write out > 100,000 records from a Python recipe.

--Tom

Duplicates generated by recipe

Answers

Categories

Setup Info

Tags