Join that REMOVES matching rows

Solved!
acrow6
Level 1
Join that REMOVES matching rows
Hi, I have a dataset where some combination of columns results in a unique key:

1 A

1 B

2 B

2 C

3 A

The first column with numerical values represents users and the second column with characters are their transactions. Each user has a unique identifier, but it is repeated in the file for every transaction.

I am trying to create a dataset where all users that purchased a certain item, for instance A, are completely removed. If I just delete the rows containing A (e.g. using a Prepare recipe), User 1 is still in the dataset via purchase B. I had a thought that this could be accomplished by identifying the list of unique users who had purchased A, and then using that list of users as a filter on the original dataset. I am not sure of the best way to accomplish this -- it seemed like the simplest way might be to do something like a left join, based on the filtered list, but set the condition that the output dataset REMOVES the rows that match.

Is this possible, or is there a better method? Or can this only be accomplished using SQL?
0 Kudos
1 Solution
Alex_Combessie
Dataiker Alumni

Hi,



What you describe is often described as the "ANTI JOIN" concept in SQL.  I would recommend indeed to accomplish this using a SQL script recipe in Dataiku.



Unfortunately, there is not a common syntax for ANTI JOINs in SQL, as for regular LEFT/RIGHT/INNER/... JOINS. Depending on the flavour of SQL database you are using, you can look up the specific syntax you need. For instance, in PostgreSQL you can use a syntax like this:




SELECT *
FROM actor a
WHERE NOT EXISTS (
SELECT * FROM film_actor fa
WHERE a.actor_id = fa.actor_id
)


This is documented online in this article: https://blog.jooq.org/2017/01/12/a-probably-incomplete-comprehensive-guide-to-the-many-different-ways-to-join-tables-in-sql/. For other SQL flavours, similar articles certainly exist to help you write the right script.



Hope it helps,



Alex

View solution in original post

0 Kudos
3 Replies
Alex_Combessie
Dataiker Alumni

Hi,



What you describe is often described as the "ANTI JOIN" concept in SQL.  I would recommend indeed to accomplish this using a SQL script recipe in Dataiku.



Unfortunately, there is not a common syntax for ANTI JOINs in SQL, as for regular LEFT/RIGHT/INNER/... JOINS. Depending on the flavour of SQL database you are using, you can look up the specific syntax you need. For instance, in PostgreSQL you can use a syntax like this:




SELECT *
FROM actor a
WHERE NOT EXISTS (
SELECT * FROM film_actor fa
WHERE a.actor_id = fa.actor_id
)


This is documented online in this article: https://blog.jooq.org/2017/01/12/a-probably-incomplete-comprehensive-guide-to-the-many-different-ways-to-join-tables-in-sql/. For other SQL flavours, similar articles certainly exist to help you write the right script.



Hope it helps,



Alex

0 Kudos
acrow6
Level 1
Author
Thank you, Alex.
0 Kudos
AshleyW
Dataiker

Hi @acrow6 ,

We've recently released a new version of Dataiku, 11.3, and you can now add additional output datasets to a Join recipe that contain all the unmatched rows from either a left, right, or inner join. This will let you easily create datasets with rows that don't match the join!

Anti-joins/exclusion joins are on our roadmap.

Cheers,

Ashley

0 Kudos