Join recipe by date issue
I have datasets with timestamp columns in CET/CEST and I parse those to UTC. So far, so good. But when I try to join those datasets by date, using those parsed date, I get unexpected matches.
Here is a minimal (non)-working example.
Dataset A: date = [2020-10-25T02:00:00+02:00, 2020-10-25T02:00:00+01:00], value = [1.0, 2.0]. Dataset B identical to A. Parsing the dates to UTC yields date_UTC = [2020-10-25T00:00:00.000Z, 2020-10-25T01:00:00.000Z]. Joining both datasets by date_UTC yields four rows, all of them having the same value for date_UTC: 2020-10-25T01:00:00.000Z.
Has anyone else experienced this?
Thanks in advance,
Romeo
Best Answer
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,063 Neuron
OK I can reproduce the issue using file system datasets. Looks like a bug to me. But to be honest what you are doing seems pretty edgy: most people will use Dataiku with a database or cloud storage and your joins on date/time columns are not that common. In any case I tested that if you use the original date column (the one defined as a string) for the join it will return the right results. And in your case it doesn't really matter to use a string column since there are no indexes in file system datasets. You should install PostgreSQL in your system so you can have a proper database for Dataiku flows.
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,063 Neuron
Can you please show screen shots of your dataset date, join condition and result dataset?
-
Sure thing.
(prefix set in the join recipe, under "Selected columns")
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,063 Neuron
I can't reproduce your issue. I get only 2 rows back as expected. What version of Dataiku are you using and where are these datasets stored on? (what technology).
-
I am running the Windows version (I am aware that it is an experimental version, but I don't have an alternative at the moment), release 12.5.0. I created the initial datasets with the "Editable" feature and they are subsequently store on "Server's Filesystem/filesystem_managed" (the only option available with my current setup). I also encountered the issue with datasets imported from CSV files (though the "upload your files" functionality).
I am unable to reproduce the issue when using the 14-day free trial (which is on version 12.4.2).
-
I completely agree with you. I'll talk with my colleagues and hopefully we'll be able to migrate to a proper database in the near future.
For now, I parse the dates to align them (because of course, not all datasets are in CET/CEST, that would be too easy) and then change their type back to string.
Anyway, thanks a lot for your time and help, I'll mark this as resolved.