Remove duplicate and returns 0 rows in the end

Solved!
impossibletovi
Level 2
Remove duplicate and returns 0 rows in the end

Hello! I`ve been trying to create a python recipe that remove duplicates based in a column and keep the last but when I try to use it in my database it remove every single row in the dataframe.

 

My code was:

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd

# Recipe inputs
df = dataiku.Dataset("OTD_2020-2023").get_dataframe()

#DROP duplicates based in `processo` column.
df.drop_duplicates(subset=["PROCESSO"], inplace=True)

# Recipe outputs
out = dataiku.Dataset("OTD_2020-2023").write_with_schema(df)

 

But I also try to put in another code an IF condition to only execute the drop_duplicate if in an especific column there`s any duplicate.

Does someone that knows what is going on right here?

0 Kudos
1 Solution
konathan
Level 3

Hi @impossibletovi !

 

I've noticed that you are writing your output to the same dataset that you have as input. Could you try to write the output to another dataset and check if the issue is resolved?

You also mentioned that you want to keep the last occurrences of the duplicated records, so you need to add keep='last' in the drop_duplicates() because, by default, it keeps the first occurrences.

 

-Konstantina

View solution in original post

0 Kudos
3 Replies
konathan
Level 3

Hi @impossibletovi !

 

I've noticed that you are writing your output to the same dataset that you have as input. Could you try to write the output to another dataset and check if the issue is resolved?

You also mentioned that you want to keep the last occurrences of the duplicated records, so you need to add keep='last' in the drop_duplicates() because, by default, it keeps the first occurrences.

 

-Konstantina

0 Kudos
impossibletovi
Level 2
Author

Thank you! I set other place to put the output and it worked.

0 Kudos

Glad I could help! ๐Ÿ˜Š