Error in a deduplication python recipe

UserBird
UserBird Dataiker, Alpha Tester Posts: 535 Dataiker

Hello,

I'm trying to remove duplicates in a dataset using a Python recipe in the form of "unique_records_for_cols = XXX_df.drop_duplicates(cols=['AAA', 'BBB'])"

My recipe seems correct (I'm successfully using a similar one on another dataset), yet the build keeps failing after a couple minutes, with the following log :


java.io.IOException: Process return code is 137
at com.dataiku.dip.dataflow.exec.AbstractCodeBasedRecipeRunner.execute(AbstractCodeBasedRecipeRunner.java:213)
at com.dataiku.dip.dataflow.exec.AbstractCodeBasedRecipeRunner.execute(AbstractCodeBasedRecipeRunner.java:196)
at com.dataiku.dip.dataflow.exec.AbstractPythonRecipeRunner.executeScript(AbstractPythonRecipeRunner.java:29)
at com.dataiku.dip.recipes.code.PythonRecipeRunner.run(PythonRecipeRunner.java:73)
at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:303)

I've run out of ideas as to where that might come from. Any suggestion ?

Thanks in advance,

Julien

Answers

  • jrouquie
    jrouquie Dataiker Alumni Posts: 87 ✭✭✭✭✭✭✭
    Hi dear user with the same initials as me ;-)

    Error 137 means that the Python process has been killed. This is probably because it was consuming too much memory, see here: http://stackoverflow.com/questions/19189522/what-does-killed-mean

    To drop duplicates on a large dataset, you could add memory on the server, use clever Python code to load only part of the dataset in memory (like only some of the columns), or switching to another processing engine, like SQL, or maybe Hadoop if the dataset gets very large.
  • JBR
    JBR Registered Posts: 6 ✭✭✭✭
    Thanks a lot for the extremely quick answer Jean-Baptiste, I'll try to find a workaround.

    Have a nice day.

    Julien
Setup Info
    Tags
      Help me…