Complex filters or queries on a dataset

Solved!
cdutoit
Level 3
Complex filters or queries on a dataset

Hi - 

I'd appreciate pointers on how to perform more complex queries on a dataset. I'm trying calibrate my thinking in terms of how Dataiku may approach it versus some of the other tools I've used.

In Knime, I'm able to use an index node to create indexes on my data, and then an index query node will let me do a more complex query on my dataset such as:

gender:M
AND (firstname:A* OR firstname:B* OR firstname:C*)
AND NOT (lastname:A* OR lastname:B* OR lastname:C*)
AND age:[20 TO 40]
AND PLZ:[30000 TO 40000]

My questions:

1. What approach should I consider in Dataiku? The Filter recipe doesn't support complex nested filters. Should I be using a Python recipe?

2. Do I first need to somehow create an index on my data? Assuming I had say 300k rows or something - or does this happen automagically 'behind the scenes'?

Thanks for any guidance or hints,

 

Chris

 

 

0 Kudos
1 Solution
tgb417

@cdutoit 

Nice to see another Knime user around here.

So having an index in your dataset if it does not already have an index is a lovely thing to have.  But you do not need one for the type of filtering you are describing.

If I really wanted/needed an index for some reason, for that size dataset you are describing, I'd be likely to use a python Step in a Prepare Recipe.

Here are the basics on using the Python Code recipe in a dataset.

Here is a quick bit of Python code I've used in the past for up to a million rows.

 

Dataiku Prepare Recipe, Showing a Create column with Python Code recipe .  This just creates a index columnDataiku Prepare Recipe, Showing a Create column with Python Code recipe . This just creates a index column

 count = 0
def process(row):
global count
count = count + 1
return count

That being said.  To filter you don't really need an index column.  Again in the prepare recipe there is a type of step called filter rows / cells with formula.  This type of step can handle the level of complexity you are describing.  Here is the documentation on that kind of step.

https://doc.dataiku.com/dss/10.0/preparation/processors/filter-on-formula.html 

Hope that helps.  Let us all know how you are getting along.

 

--Tom

View solution in original post

2 Replies
tgb417

@cdutoit 

Nice to see another Knime user around here.

So having an index in your dataset if it does not already have an index is a lovely thing to have.  But you do not need one for the type of filtering you are describing.

If I really wanted/needed an index for some reason, for that size dataset you are describing, I'd be likely to use a python Step in a Prepare Recipe.

Here are the basics on using the Python Code recipe in a dataset.

Here is a quick bit of Python code I've used in the past for up to a million rows.

 

Dataiku Prepare Recipe, Showing a Create column with Python Code recipe .  This just creates a index columnDataiku Prepare Recipe, Showing a Create column with Python Code recipe . This just creates a index column

 count = 0
def process(row):
global count
count = count + 1
return count

That being said.  To filter you don't really need an index column.  Again in the prepare recipe there is a type of step called filter rows / cells with formula.  This type of step can handle the level of complexity you are describing.  Here is the documentation on that kind of step.

https://doc.dataiku.com/dss/10.0/preparation/processors/filter-on-formula.html 

Hope that helps.  Let us all know how you are getting along.

 

--Tom
cdutoit
Level 3
Author

Hi @tgb417 

Thank you for your reply. Somehow I completely missed the ability to use a formula in the filter step, in addition to just doing it in the prepare recipe. One the major differences coming from Knime is that in Dataiku the prepare can do so much all within a single "node".

I also hadn't thought of the ability to use python within a prepare! Thanks for your sample.

Thanks again for the help