Join that REMOVES matching rows
1 A
1 B
2 B
2 C
3 A
The first column with numerical values represents users and the second column with characters are their transactions. Each user has a unique identifier, but it is repeated in the file for every transaction.
I am trying to create a dataset where all users that purchased a certain item, for instance A, are completely removed. If I just delete the rows containing A (e.g. using a Prepare recipe), User 1 is still in the dataset via purchase B. I had a thought that this could be accomplished by identifying the list of unique users who had purchased A, and then using that list of users as a filter on the original dataset. I am not sure of the best way to accomplish this -- it seemed like the simplest way might be to do something like a left join, based on the filtered list, but set the condition that the output dataset REMOVES the rows that match.
Is this possible, or is there a better method? Or can this only be accomplished using SQL?
Best Answer
-
Hi,
What you describe is often described as the "ANTI JOIN" concept in SQL. I would recommend indeed to accomplish this using a SQL script recipe in Dataiku.
Unfortunately, there is not a common syntax for ANTI JOINs in SQL, as for regular LEFT/RIGHT/INNER/... JOINS. Depending on the flavour of SQL database you are using, you can look up the specific syntax you need. For instance, in PostgreSQL you can use a syntax like this:
SELECT *
FROM actor a
WHERE NOT EXISTS (
SELECT * FROM film_actor fa
WHERE a.actor_id = fa.actor_id
)This is documented online in this article: https://blog.jooq.org/2017/01/12/a-probably-incomplete-comprehensive-guide-to-the-many-different-ways-to-join-tables-in-sql/. For other SQL flavours, similar articles certainly exist to help you write the right script.
Hope it helps,
Alex
Answers
-
Thank you, Alex.
-
AshleyW Dataiker, Alpha Tester, Dataiku DSS Core Designer, Registered, Product Ideas Manager Posts: 161 Dataiker
Hi @acrow6
,We've recently released a new version of Dataiku, 11.3, and you can now add additional output datasets to a Join recipe that contain all the unmatched rows from either a left, right, or inner join. This will let you easily create datasets with rows that don't match the join!
Anti-joins/exclusion joins are on our roadmap.
Cheers,
Ashley