Keep only rows that has timestamp between start and stop columns from another dataset

Fragan
Fragan Registered Posts: 3 ✭✭✭✭
edited July 18 in Using Dataiku

Hi, im kinda new to DSS.



I have a huge Dataset (with 20M+ rows), this dataset has a timestamp column, and 20 other columns.

I have another Dataset with like 15 rows, this one has 3 columns start, stop and flag.

Foreach row in my first Dataset i want to check if the timestamp is between start and stop of a row in the second dataset, if it's i want to enrich the first dataset with the flag from the second. And if a row in my first Dataset isn't between any of the start and stop, i don't want to keep it.



I've done this with a Python recipe using Pandas, but im trying to do it only with DSS recipes.

I tried to do it with a Join recipe with this SQL code :


$left.timestamp>=$right.start AND $left.timestamp<=$right.stop;

but it's not working..

I found one way but it takes too much disk space and time: Do a cross join between my two Datasets and then do a filter with the value of start and stop

Can you guys suggest me a good way to do this ?

Best Answer

  • Liev
    Liev Dataiker Alumni Posts: 176 ✭✭✭✭✭✭✭✭
    edited July 18 Answer ✓

    Hi Fragan,

    Please use the following syntax on your join clause


    "Test_dataset"."timestamp">="testa"."start" AND "Test_dataset"."timestamp"<="testa"."stop"

    mind the quotes around the field names.

Setup Info
    Tags
      Help me…