Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
There is currently no way to do that in a visual preparation recipe* (because a visual recipe more or less works row by row, and it cannot work on a full column, as it is designed for big data).
It's possible to do so in a visual GROUP recipe: click โShow mass actionsโ, select all columns, click โuse as grouping keysโ. If the csv is very big, I suggest synchronizing to a SQL DB first.
You can also do so in coding recipes:
* There is actually one way to do it in a visual preparation recipe, with a custom Python function, but that will not work all the time (if the recipe is multi-threaded), so I would not recommend this trick:
I hope that helps,
Jeremy
Hi,
In a python recipe I would do:
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd
# Recipe inputs
df = dataiku.Dataset("input_dataset").get_dataframe()
df.drop_duplicates(subset=["my_key_colum"], inplace=True)
# or
# df.drop_duplicates(inplace=True)
# to use all columns to compare for duplicates
# Recipe outputs
out = dataiku.Dataset("output_dataset").write_with_schema(df)
Matt