Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on March 9, 2016 10:58PM
Likes: 0
Replies: 6
There is currently no way to do that in a visual preparation recipe* (because a visual recipe more or less works row by row, and it cannot work on a full column, as it is designed for big data).
It's possible to do so in a visual GROUP recipe: click “Show mass actions”, select all columns, click “use as grouping keys”. If the csv is very big, I suggest synchronizing to a SQL DB first.
You can also do so in coding recipes:
* There is actually one way to do it in a visual preparation recipe, with a custom Python function, but that will not work all the time (if the recipe is multi-threaded), so I would not recommend this trick:
I hope that helps,
Jeremy
Hi,
In a python recipe I would do:
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd
# Recipe inputs
df = dataiku.Dataset("input_dataset").get_dataframe()
df.drop_duplicates(subset=["my_key_colum"], inplace=True)
# or
# df.drop_duplicates(inplace=True)
# to use all columns to compare for duplicates
# Recipe outputs
out = dataiku.Dataset("output_dataset").write_with_schema(df)
Matt
Hello, sorry bringing the thread up, I'd need support because I'm trying deduplicating my output with Python but I can't understand how to edit the code. I need to deduplicate TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined because - as you can see from the screen below - I have for the same ID (first column) duplicate values. Indeed the column CD_BOOKING also has the same repeating values.
This is the Pyton code that I haven't touched yet.
# -*- coding: utf-8 -*- import dataiku import pandas as pd, numpy as np from dataiku import pandasutils as pdu # Read recipe inputs TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined = dataiku.Dataset("TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined") TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined_df = TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined.get_dataframe() # Compute recipe outputs from inputs # TODO: Replace this part by your actual code that computes the output, as a Pandas dataframe # NB: DSS also supports other kinds of APIs for reading and writing data. Please see doc. phone_deduplicated_df = TDD_FT_CUSTOMER_JOURNEY_joined_filtered_joined_df # For this sample code, simply copy input to output # Write recipe outputs phone_deduplicated = dataiku.Dataset("phone_deduplicated") phone_deduplicated.write_with_schema(phone_deduplicated_df)
Hi Chemical, I suggest you start a new thread since your problem is different. Also you say you want to do it in SQL but you posted Python code.
Ciao Turribeach, I've slightly adjusted my post in order to be better understandable now, I think my question fits enough with what was asked in this thread.
Hi, because you are not the original poster of the question you can't mark the question as answered or select the correct response. So I suggest you repost as a new thread.