Complex filters or queries on a dataset

cdutoit
cdutoit Registered Posts: 12 ✭✭✭✭

Hi -

I'd appreciate pointers on how to perform more complex queries on a dataset. I'm trying calibrate my thinking in terms of how Dataiku may approach it versus some of the other tools I've used.

In Knime, I'm able to use an index node to create indexes on my data, and then an index query node will let me do a more complex query on my dataset such as:

gender:M
AND (firstname:A* OR firstname:B* OR firstname:C*)
AND NOT (lastname:A* OR lastname:B* OR lastname:C*)
AND age:[20 TO 40]
AND PLZ:[30000 TO 40000]

My questions:

1. What approach should I consider in Dataiku? The Filter recipe doesn't support complex nested filters. Should I be using a Python recipe?

2. Do I first need to somehow create an index on my data? Assuming I had say 300k rows or something - or does this happen automagically 'behind the scenes'?

Thanks for any guidance or hints,

Chris

Tagged:

Best Answer

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,607 Neuron
    edited July 2024 Answer ✓

    @cdutoit

    Nice to see another Knime user around here.

    So having an index in your dataset if it does not already have an index is a lovely thing to have. But you do not need one for the type of filtering you are describing.

    If I really wanted/needed an index for some reason, for that size dataset you are describing, I'd be likely to use a python Step in a Prepare Recipe.

    Here are the basics on using the Python Code recipe in a dataset.

    Here is a quick bit of Python code I've used in the past for up to a million rows.

    Dataiku Prepare Recipe, Showing a Create column with Python Code recipe .  This just creates a index column

     count = 0
    def process(row):
    global count
    count = count + 1
    return count

    That being said. To filter you don't really need an index column. Again in the prepare recipe there is a type of step called filter rows / cells with formula. This type of step can handle the level of complexity you are describing. Here is the documentation on that kind of step.

    https://doc.dataiku.com/dss/10.0/preparation/processors/filter-on-formula.html

    Hope that helps. Let us all know how you are getting along.

Answers

  • cdutoit
    cdutoit Registered Posts: 12 ✭✭✭✭

    Hi @tgb417

    Thank you for your reply. Somehow I completely missed the ability to use a formula in the filter step, in addition to just doing it in the prepare recipe. One the major differences coming from Knime is that in Dataiku the prepare can do so much all within a single "node".

    I also hadn't thought of the ability to use python within a prepare! Thanks for your sample.

    Thanks again for the help

Setup Info
    Tags
      Help me…