Join that REMOVES matching rows

acrow6 Registered Posts: 3 ✭✭✭✭
Hi, I have a dataset where some combination of columns results in a unique key:

1 A

1 B

2 B

2 C

3 A

The first column with numerical values represents users and the second column with characters are their transactions. Each user has a unique identifier, but it is repeated in the file for every transaction.

I am trying to create a dataset where all users that purchased a certain item, for instance A, are completely removed. If I just delete the rows containing A (e.g. using a Prepare recipe), User 1 is still in the dataset via purchase B. I had a thought that this could be accomplished by identifying the list of unique users who had purchased A, and then using that list of users as a filter on the original dataset. I am not sure of the best way to accomplish this -- it seemed like the simplest way might be to do something like a left join, based on the filtered list, but set the condition that the output dataset REMOVES the rows that match.

Is this possible, or is there a better method? Or can this only be accomplished using SQL?

Best Answer

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    edited July 18 Answer ✓


    What you describe is often described as the "ANTI JOIN" concept in SQL. I would recommend indeed to accomplish this using a SQL script recipe in Dataiku.

    Unfortunately, there is not a common syntax for ANTI JOINs in SQL, as for regular LEFT/RIGHT/INNER/... JOINS. Depending on the flavour of SQL database you are using, you can look up the specific syntax you need. For instance, in PostgreSQL you can use a syntax like this:

    SELECT *
    FROM actor a
    SELECT * FROM film_actor fa
    WHERE a.actor_id = fa.actor_id

    This is documented online in this article: For other SQL flavours, similar articles certainly exist to help you write the right script.

    Hope it helps,



Setup Info
      Help me…