Survey banner
The Dataiku Community is moving to a new home! We are temporary in read only mode: LEARN MORE

Remove duplicates in Prepare recipe

Level 2
Remove duplicates in Prepare recipe


I am trying to remove duplicates in a Prepare recipe, but as far as I can tell that is not possible, even though I would think its a pretty basic feature. I have an expression in the Prepare recipe and I am therefore not using the Distinct recipe - I could just apply the Distinct recipe before my Prepare recipe, but I am trying to avoid storing a large intermediate dataset. So if I could apply both steps somehow in one, then I would be happy (I don´t know if enabling merging of several Spark recipes into a single Spark job would help that pain?)

Highly appreciate any advice here.





0 Kudos
1 Reply

@NickPedersen ,

Welcome to the Dataiku community.

You might want to take a look at Spark pipelines. Or if you are using a SQL server you can use SQL pipelines described further in How to enable SQL pipelines in the Flow. These are apparently implemented as views that get added to the SQL database so they don't get created as tables taking up more space.

You will find that not all features are available when working in this way.  Here is a little more about Where ... it all happens that might be helpful to you.



Labels (1)
A banner prompting to get Dataiku