Duplicates generated by recipe

Batpig
Batpig Registered Posts: 2 ✭✭✭

Hi everyone,

My DSS server exhibits a strange behaviour:

Every table generated from recipe create duplicates: If i enter 100,000 lines table without doing any operations on it, this will result in a 100,000 lines output, but with x duplicates inside instead of original data

Has anyone already experienced the same issue?

Below a simple example:

The input table:

DSS_1.GIF

In / Out recipe

DSS_2.GIF

Output with duplicates:

DSS_3.GIF

Best regards,

Baptiste

Answers

  • JeremieP
    JeremieP Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner Posts: 7 Dataiker

    Hi Baptiste,

    In your Python recipe, can you go to the "Input/Output" tab and check that for your output dataset, the option "Append instead of overwrite" is not activated ?

  • Batpig
    Batpig Registered Posts: 2 ✭✭✭

    Hi Jeremie,

    Thanks for your answer, indeed the option wasn't activated

    Lookslike the problem is with my python env, as the problem don't occur with virtualenvs

    Baptiste

  • Rubenl92
    Rubenl92 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Registered Posts: 4 ✭✭✭✭

    Hi @Batpig

    I have exactly the same issue!

    What did you change in your python env to solve this?

    I have many packages installed and don't know which is the issue.

    Thx

    Ruben

  • Rubenl92
    Rubenl92 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Registered Posts: 4 ✭✭✭✭

    I found out the issue.

    You need to select "Install mandatory set of packages (you won't be able to use Dataiku APIs without this)"

    We we're missing the pandas package and that caused the error

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @Rubenl92
    ,

    Thanks for you insights. I think I have been having a similar problem on one of my DSS instances. You can find a description of these problems at https://community.dataiku.com/t5/Plugins-Extending-Dataiku/Is-there-a-limit-to-a-directory-structure-that-list-paths-in/m-p/21317#M1330 . Does this sound like the problem you are describing? Data coming out of a python recipient after 100,000 records is being “duplicated” or maybe a better description, overwritten by later data.

    What version of DSS are you all using? I’ve seen my similar problem with DSS v 9.0.5 on Mac OS. I’ve also not seen the problem on DSS V10.0.0 or V10.0.2.

    Can you say a bit more about the steps you took to resolve your issue. I’m fairly sure that I have Pandas and the correct DSS libraries installed in all of the correct places on my instance. I’m wondering what steps you took to get to your resolution.

    Thanks for sharing your insights

  • Rubenl92
    Rubenl92 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Registered Posts: 4 ✭✭✭✭

    Hi,

    I am not sure we have exactly the same issue.

    I changed my python environment (we have multiple) and used the same python script that was not working before.

    The outcome dataset was now suddenly correct. Then I analysed the changes between the python environment. The main difference was the "install mandatory packages" button.

    My DSS is version: 9.0.3

    Kind regards

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @Rubenl92

    Thanks for sharing further details.

    What is/was suggesting to me that we might be looking at a similar situation is that we are trying to write out a dataset from a Python Recipe. Our data sets seem to get corrupted when we try to export more than 100,000 records. (My records happen to be about a file system.) However, I've determined that, that part of my code seems to be doing the right thing. The place I get into problems is when trying to write out > 100,000 records from a Python recipe.

    --Tom

Setup Info
    Tags
      Help me…