Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have datasets with timestamp columns in CET/CEST and I parse those to UTC. So far, so good. But when I try to join those datasets by date, using those parsed date, I get unexpected matches.
Here is a minimal (non)-working example.
Dataset A: date = [2020-10-25T02:00:00+02:00, 2020-10-25T02:00:00+01:00], value = [1.0, 2.0]. Dataset B identical to A. Parsing the dates to UTC yields date_UTC = [2020-10-25T00:00:00.000Z, 2020-10-25T01:00:00.000Z]. Joining both datasets by date_UTC yields four rows, all of them having the same value for date_UTC: 2020-10-25T01:00:00.000Z.
Has anyone else experienced this?
Thanks in advance,
Romeo
OK I can reproduce the issue using file system datasets. Looks like a bug to me. But to be honest what you are doing seems pretty edgy: most people will use Dataiku with a database or cloud storage and your joins on date/time columns are not that common. In any case I tested that if you use the original date column (the one defined as a string) for the join it will return the right results. And in your case it doesn't really matter to use a string column since there are no indexes in file system datasets. You should install PostgreSQL in your system so you can have a proper database for Dataiku flows.
Can you please show screen shots of your dataset date, join condition and result dataset?
Sure thing.
(prefix set in the join recipe, under "Selected columns")
I can't reproduce your issue. I get only 2 rows back as expected. What version of Dataiku are you using and where are these datasets stored on? (what technology).
I am running the Windows version (I am aware that it is an experimental version, but I don't have an alternative at the moment), release 12.5.0. I created the initial datasets with the "Editable" feature and they are subsequently store on "Server's Filesystem/filesystem_managed" (the only option available with my current setup). I also encountered the issue with datasets imported from CSV files (though the "upload your files" functionality).
I am unable to reproduce the issue when using the 14-day free trial (which is on version 12.4.2).
OK I can reproduce the issue using file system datasets. Looks like a bug to me. But to be honest what you are doing seems pretty edgy: most people will use Dataiku with a database or cloud storage and your joins on date/time columns are not that common. In any case I tested that if you use the original date column (the one defined as a string) for the join it will return the right results. And in your case it doesn't really matter to use a string column since there are no indexes in file system datasets. You should install PostgreSQL in your system so you can have a proper database for Dataiku flows.
I completely agree with you. I'll talk with my colleagues and hopefully we'll be able to migrate to a proper database in the near future.
For now, I parse the dates to align them (because of course, not all datasets are in CET/CEST, that would be too easy) and then change their type back to string.
Anyway, thanks a lot for your time and help, I'll mark this as resolved.