Filter recipe
Hi Team,
So I have pretty huge database and I need to filter only part of it, basically I need 2 countries out of 15, some part of a rx_otc column and dates after 2017.
Can I combine all 3 filters in one recipe? and as for country should I keep country equals = Germany and equals = France, or maybe instead of equals I should use contains to get both countries? [attached screen]
Or maybe it would be good to divide it into 2 filter recipes to get better performance? So first Select only 2 Countries, and then add this 2nd filter with date and Rx_otc? does it matter?
Thanks
Mateusz
Best Answers
-
Hi,
The way filter recipe works is that based on your conditions DSS will generate a SQL query that will be executed in your database.
Before executing a query databases build what's called an execution plan, at this stage they will try to decide how to query the data in the most efficient way. While doing it DB may reorganize your query, that's why it doesn't matter whether you have multiple "equals" conditions or on "contains" - most likely DB will convert it to the same execution plan.
Breaking one filter recipe into 2 is most likely a bad idea though, not only you will lose some time on writing intermediate result, your DB won't be able to choose the best way to filter data (like filtering first on a column with a higher selectivity).
You may also consider partitioning your dataset based on the columns you've mentioned. In this case your flow will be working with only a subset of data.
https://doc.dataiku.com/dss/latest/partitions/index.html
Regards,
-
Hi @emate
,I guess I wasn't clean enough in my previous answer. I was explaining the difference between 2 types of queries:
country = "France" or country = "Germany"
vs
country in ("France", "Germany")
However in your case you're trying to combine some conditions with "and" and some with "or".
With the visual editor you can only combine all conditions with "or" or with "and"
The only way to do it in a filter recipe would be to use "SQL expression" that would look like:
"country" in ('France', 'Germany') and "rx_otc" = 'RX' and date > '2020-01-31'
Regards
Answers
-
yes, it makes sense to leave it as it is. (I just spotted that in the screenshot you have recordId next to France, not country - might be a small bug).
Also, think about adding indices on the columns used for filtering, it'll improve the query performance
-
Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭
Hi @Andrey
I am wondering, because once I converted this recipe to a sparkSQL after Where clause it is saying rx_otc=rx and date > 2018.... and country = GERMANY and country = FRANCE - shouldnt we have "or" between Germany and France to capture all rows for France and/or Germany and only rx_otc=rx and this timeframe?
Thanks
Mateusz
-
Mateusz Dataiku DSS Core Designer, Neuron 2020, Registered, Neuron 2021, Neuron 2022 Posts: 91 ✭✭✭✭✭✭
Thank you for explaining - it worked with the SQL query. BTW i think SQL query works much faster than filter recipe? is that possible?