Join recipe by date issue

Solved!
RomeoT
Level 1
Join recipe by date issue

I have datasets with timestamp columns in CET/CEST and I parse those to UTC. So far, so good. But when I try to join those datasets by date, using those parsed date, I get unexpected matches.

Here is a minimal (non)-working example.

Dataset A: date = [2020-10-25T02:00:00+02:00, 2020-10-25T02:00:00+01:00], value = [1.0, 2.0]. Dataset B identical to A. Parsing the dates to UTC yields date_UTC = [2020-10-25T00:00:00.000Z, 2020-10-25T01:00:00.000Z]. Joining both datasets by date_UTC yields four rows, all of them having the same value for date_UTC: 2020-10-25T01:00:00.000Z.

Has anyone else experienced this?

Thanks in advance,

Romeo

0 Kudos
1 Solution
Turribeach

OK I can reproduce the issue using file system datasets. Looks like a bug to me. But to be honest what you are doing seems pretty edgy: most people will use Dataiku with a database or cloud storage and your joins on date/time columns are not that common. In any case I tested that if you use the original date column (the one defined as a string) for the join it will return the right results. And in your case it doesn't really matter to use a string column since there are no indexes in file system datasets. You should install PostgreSQL in your system so you can have a proper database for Dataiku flows.

View solution in original post

0 Kudos
6 Replies
Turribeach

Can you please show screen shots of your dataset date, join condition and result dataset?

0 Kudos
RomeoT
Level 1
Author

Sure thing.

Input datasets (post parsing)Input datasets (post parsing)

Join conditionJoin condition

Resulting datasetResulting dataset

(prefix set in the join recipe, under "Selected columns")

0 Kudos
Turribeach

I can't reproduce your issue. I get only 2 rows back as expected. What version of Dataiku are you using and where are these datasets stored on? (what technology). 

Capture.PNG

0 Kudos
RomeoT
Level 1
Author

I am running the Windows version (I am aware that it is an experimental version, but I don't have an alternative at the moment), release 12.5.0. I created the initial datasets with the "Editable" feature and they are subsequently store on "Server's Filesystem/filesystem_managed" (the only option available with my current setup). I also encountered the issue with datasets imported from CSV files (though the "upload your files" functionality).

I am unable to reproduce the issue when using the 14-day free trial (which is on version 12.4.2).

0 Kudos
Turribeach

OK I can reproduce the issue using file system datasets. Looks like a bug to me. But to be honest what you are doing seems pretty edgy: most people will use Dataiku with a database or cloud storage and your joins on date/time columns are not that common. In any case I tested that if you use the original date column (the one defined as a string) for the join it will return the right results. And in your case it doesn't really matter to use a string column since there are no indexes in file system datasets. You should install PostgreSQL in your system so you can have a proper database for Dataiku flows.

0 Kudos
RomeoT
Level 1
Author

I completely agree with you. I'll talk with my colleagues and hopefully we'll be able to migrate to a proper database in the near future.

For now, I parse the dates to align them (because of course, not all datasets are in CET/CEST, that would be too easy) and then change their type back to string.

Anyway, thanks a lot for your time and help, I'll mark this as resolved.

0 Kudos