Try your hand at analyzing royal sentiment in Dataiku DSS! Learn more

Filter recipe

Level 4
Filter recipe

Hi Team,

So I have pretty huge database and I need to filter only part of it, basically I need 2 countries out of 15, some part of a rx_otc column and dates after 2017.

Can I combine all 3 filters in one recipe? and as for country should I keep country equals = Germany and equals = France, or maybe instead of equals I should use contains to get both countries? [attached screen]

Or maybe it would be good to divide it into 2 filter recipes to get better performance? So first Select only 2 Countries, and then add this 2nd filter with date and Rx_otc? does it matter?

Thanks

Mateusz

 

 

 

 

6 Replies
Dataiker
Dataiker

Hi,

The way filter recipe works is that based on your conditions DSS will generate a SQL query that will be executed in your database.

Before executing a query databases build what's called an execution plan, at this stage they will try to decide how to query the data in the most efficient way. While doing it DB may reorganize your query, that's why it doesn't matter whether you have multiple "equals" conditions or on "contains" - most likely DB will convert it to the same execution plan.

Breaking one filter recipe into 2 is most likely a bad idea though, not only you will lose some time on writing intermediate result, your DB won't be able to choose the best way to filter data (like filtering first on a column with a higher selectivity).

 

You may also consider partitioning your dataset based on the columns you've mentioned. In this case your flow will be working with only a subset of data.

https://doc.dataiku.com/dss/latest/partitions/index.html

 

Regards,

Andrey Avtomonov
R&D Engineer @ Dataiku
Level 4
Author

Hi @Andrey 

So just to clarify, if I will set it as it is right now, my output will have rows:

France or Germany & rx_otc = rx & date after mentioned timeframe?

Thanks

Mateusz

Dataiker
Dataiker

yes, it makes sense to leave it as it is. (I just spotted that in the screenshot you have recordId next to France, not country - might be a small bug). 

Also, think about adding indices on the columns used for filtering, it'll improve the query performance

Andrey Avtomonov
R&D Engineer @ Dataiku
Level 4
Author

Hi @Andrey 

I am wondering, because once I converted this recipe to a sparkSQL after Where clause it is saying rx_otc=rx and date > 2018.... and country = GERMANY and country = FRANCE - shouldnt we have "or" between Germany and France to capture all rows for France and/or Germany and only rx_otc=rx and this timeframe?

Thanks

Mateusz

0 Kudos
Dataiker
Dataiker

Hi @emate ,

I guess I wasn't clean enough in my previous answer. I was explaining the difference between 2 types of queries:

country = "France" or country = "Germany"

vs

country in ("France", "Germany")

 

 

However in your case you're trying to combine some conditions with "and" and some with "or".

With the visual editor you can only combine all conditions with "or" or with "and"

The only way to do it in a filter recipe would be to use "SQL expression" that would look like:

"country" in ('France', 'Germany') and "rx_otc" = 'RX' and date > '2020-01-31'

 

Regards

 

Andrey Avtomonov
R&D Engineer @ Dataiku
Level 4
Author

Thank you for explaining - it worked with the SQL query. BTW i think SQL query works much faster than filter recipe? is that possible?

0 Kudos