Join recipe by date issue

RomeoT
RomeoT Registered Posts: 4

I have datasets with timestamp columns in CET/CEST and I parse those to UTC. So far, so good. But when I try to join those datasets by date, using those parsed date, I get unexpected matches.

Here is a minimal (non)-working example.

Dataset A: date = [2020-10-25T02:00:00+02:00, 2020-10-25T02:00:00+01:00], value = [1.0, 2.0]. Dataset B identical to A. Parsing the dates to UTC yields date_UTC = [2020-10-25T00:00:00.000Z, 2020-10-25T01:00:00.000Z]. Joining both datasets by date_UTC yields four rows, all of them having the same value for date_UTC: 2020-10-25T01:00:00.000Z.

Has anyone else experienced this?

Thanks in advance,

Romeo

Best Answer

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,024 Neuron
    Answer ✓

    OK I can reproduce the issue using file system datasets. Looks like a bug to me. But to be honest what you are doing seems pretty edgy: most people will use Dataiku with a database or cloud storage and your joins on date/time columns are not that common. In any case I tested that if you use the original date column (the one defined as a string) for the join it will return the right results. And in your case it doesn't really matter to use a string column since there are no indexes in file system datasets. You should install PostgreSQL in your system so you can have a proper database for Dataiku flows.

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,024 Neuron

    Can you please show screen shots of your dataset date, join condition and result dataset?

  • RomeoT
    RomeoT Registered Posts: 4

    Sure thing.

    Input datasets (post parsing)

    Join condition

    Resulting dataset

    (prefix set in the join recipe, under "Selected columns")

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,024 Neuron

    I can't reproduce your issue. I get only 2 rows back as expected. What version of Dataiku are you using and where are these datasets stored on? (what technology).

    Capture.PNG

  • RomeoT
    RomeoT Registered Posts: 4

    I am running the Windows version (I am aware that it is an experimental version, but I don't have an alternative at the moment), release 12.5.0. I created the initial datasets with the "Editable" feature and they are subsequently store on "Server's Filesystem/filesystem_managed" (the only option available with my current setup). I also encountered the issue with datasets imported from CSV files (though the "upload your files" functionality).

    I am unable to reproduce the issue when using the 14-day free trial (which is on version 12.4.2).

  • RomeoT
    RomeoT Registered Posts: 4

    I completely agree with you. I'll talk with my colleagues and hopefully we'll be able to migrate to a proper database in the near future.

    For now, I parse the dates to align them (because of course, not all datasets are in CET/CEST, that would be too easy) and then change their type back to string.

    Anyway, thanks a lot for your time and help, I'll mark this as resolved.

Setup Info
    Tags
      Help me…