The Dataiku Frontrunner Awards have just launched to recognize your achievements! Submit Your Entry

Keep only rows that has timestamp between start and stop columns from another dataset

Solved!
Fragan
Level 1
Keep only rows that has timestamp between start and stop columns from another dataset

Hi, im kinda new to DSS.



I have a huge Dataset (with 20M+ rows), this dataset has a timestamp column, and 20 other columns.



I have another Dataset with like 15 rows, this one has 3 columns start, stop and flag.



Foreach row in my first Dataset i want to check if the timestamp is between start and stop of a row in the second dataset, if it's i want to enrich the first dataset with the flag from the second. And if a row in my first Dataset isn't between any of the start and stop, i don't want to keep it.



I've done this with a Python recipe using Pandas, but im trying to do it only with DSS recipes.



I tried to do it with a Join recipe with this SQL code : 




$left.timestamp>=$right.start AND $left.timestamp<=$right.stop;


but it's not working..



I found one way but it takes too much disk space and time:  Do a cross join between my two Datasets and then do a filter with the value of start and stop



 



Can you guys suggest me a good way to do this ?

0 Kudos
1 Solution
Liev
Dataiker
Dataiker

Hi Fragan, 



Please use the following syntax on your join clause




"Test_dataset"."timestamp">="testa"."start" AND "Test_dataset"."timestamp"<="testa"."stop"


mind the quotes around the field names.



 

View solution in original post

0 Kudos
1 Reply
Liev
Dataiker
Dataiker

Hi Fragan, 



Please use the following syntax on your join clause




"Test_dataset"."timestamp">="testa"."start" AND "Test_dataset"."timestamp"<="testa"."stop"


mind the quotes around the field names.



 

View solution in original post

0 Kudos
Labels (2)
A banner prompting to get Dataiku DSS