Remove duplicates in Prepare recipe
Hi,
I am trying to remove duplicates in a Prepare recipe, but as far as I can tell that is not possible, even though I would think its a pretty basic feature. I have an expression in the Prepare recipe and I am therefore not using the Distinct recipe - I could just apply the Distinct recipe before my Prepare recipe, but I am trying to avoid storing a large intermediate dataset. So if I could apply both steps somehow in one, then I would be happy (I don´t know if enabling merging of several Spark recipes into a single Spark job would help that pain?)
Highly appreciate any advice here.
Thanks
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
Welcome to the Dataiku community.
You might want to take a look at Spark pipelines. Or if you are using a SQL server you can use SQL pipelines described further in How to enable SQL pipelines in the Flow. These are apparently implemented as views that get added to the SQL database so they don't get created as tables taking up more space.
You will find that not all features are available when working in this way. Here is a little more about Where ... it all happens that might be helpful to you.