Visual Join Recipe: Join on a Constant condition not a column value in the other dataset
User Story:
As an analyst who is selectively joining data through the visual prepare recipe, I'd like to be able to add a constant as one of the join conditions, so that I can make more selective joins through the visual prepare recipes without having to write SQL from scratch.
Conditions of Satisfaction (COS):
- Well documented
- Warning if only a constant condition is used in order to avoid Cartesian product results.
Notes:
Today you can join on a column value coming from either dataset. But you can not pick a constant. I have a use case where a join has to be done only if a value in a column is equal to a certain constant value and that value does not appear in the table being joined. For example I can not model the following SQL in the visual Join recipe (without writing a minimally documented advanced join).
Select *
From t_customer as tc
left join t_address as ta
on tc.customer_no = ta.customer_no
and ta.address_type = 'primary_address'
In a very real case like this for me, the constant is being used because the database has other address types and we want all customer records. We do not want to duplicate customer records by adding result records for different address types like 'home_address'. Finally even if we do not have a primary address defined. We do want all of the customer records. This kind of join can not be done with a where clause.
I do note that one can use a "custom SQL condition" in a visual prepare recipe. But it is hard for the "analyst" persona to figure out how to use this more advanced setup. For example to do the above join, one would use the following condition. Assuming that t_customer is $left and t_address is $right:
$left."customer_no" = $right."customer_no"
and $left."address_type" = 'primary_address'
There appears to be little or no documentation on the "Advanced join" other than this comment in the knowledge base.
Advanced join | Provides custom options for row selection and deduplication for when none of the other options are suitable. |
And these some what cryptic notes on the actual dialog box for custom join conditions.
Write a custom SQL condition. It will be copied in the request as:
SELECT (...) FROM (...) JOIN (...) ON <your_code_here>
You should use $left and $right to refer to both datasets.
The $left and $right tokens will be replaced by the corresponding quoted
table names.
It is nice that we can do this. This is a request for at least better documentation of this advanced feature. But preferably and option to use a constant rather than a table for the right part of the join conditions.
Comments
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
After further research on this topic. I note that a "Pre Filter" might work for this use case. In this case one would pre-filter the right table t_address to only permit joins on records for which address_type = 'primary_address'. Still there was a bunch or research needed to come to this conclusion.
However, if you are trying to do the constant from the left table in a left join. This approach would not work. -
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,161 Neuron
I am able to pre-filter on both sides of the join in a Join recipe.
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
The case that this would be "required" is for example when you need to do a left join and select which records to do the left join on from the left table.
So maybe this is a very small case. But, I'll leave it here for now.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,161 Neuron
That's what I mean Tom, the join recipe pre-filter allows you to pre-filter any of the joined tables by any condition. So I don't how you wouldn't be able to add a predicate to any join table like address_type = 'primary_address'. Is there any issues as to why you can't do the pre-filter on the left join table?
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
If the value is in the left table and you want to keep all of the records from the left table.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,161 Neuron
Ok I got you now.