You now have until September 15th to submit your use case or success story to the 2022 Dataiku Frontrunner Awards!ENTER YOUR SUBMISSION

Identical timestamp on all rows

johntarr
Level 2
Identical timestamp on all rows

I've got a prep recipe that appends to its output dataset.

I want to add a timestamp to the records each time new rows are appended.

I've tried using now(), but that results in a slightly different timestamp on each row instead of the same timestamp for all records. 

Does anyone know how to get the exact same timestamp on all rows?

6 Replies
Ignacio_Toledo

Hi @johntarr,

I can reproduce the behavior you are seeing in the DSS 10.0.6, when using a "prepare" recipe with the local stream processor. This is not the behavior one would expect!

Apparently, the "now" is calculated for each row individually, and we are seeing the delay in the calculation between rows or rows batches.

I was wondering if this kind of behavior could be replicated in other cases, like when using python and pandas... and voila, you can also see it in some cases. Doing this:

 

df['now'] = pd.Timestamp.now()

 

produces the behavior one usually expect, where a new row is added to a dataframe with a constant value for "now", as one can veryify with "df.now.unique":

 

array(['2022-06-10T19:07:34.906616000'], dtype='datetime64[ns]')

 

However, using this other method:

 

df['now'] = df.apply(lambda x: pd.Timestamp.now(), axis=1)

 

behaves in the same way as the visual recipe in dataiku! And when checking with unique, I have a different timestamp for all the rows.

 

array(['2022-06-10T19:09:53.711466000', '2022-06-10T19:09:53.711627000',
       '2022-06-10T19:09:53.711638000', ...,
       '2022-06-10T19:09:54.276323000', '2022-06-10T19:09:54.276329000',
       '2022-06-10T19:09:54.276335000'], dtype='datetime64[ns]')

 

 

As a "coder", this is something that perhaps one should know, specially if nanoseconds are your thing, and it is kind of obvious when you understand the way that ".apply" works. But for a "visual" user, this is not at all the expected behavior.

Perhaps this is a "bug" to report?

Hope this helps, even when I can't provide you a workaround using the Prepare recipe.

Nicolas_Servel
Dataiker
Dataiker

Edited the answer after having checked since when proposed solution is available (DSS 10.0.4)

Hello John,

As Ignacio mentioned, DSS local stream engine will process row one by one, and call the "now()" function for each row, hence giving a slightly different value each time.

What you want is to retrieve a global information, i.e. the build date of your dataset. Since DSS 10.0.4, it is accessible through the "Enrich record with build information" step. You can precise a "Build date column" that will be unique for each run and correspond to the build date.

Then you can easily extract the timestamp from this date column.

 

Hope this helps,

Best,

Nicolas Servel

 

PS: if you were to run your prepare recipe with the SQL engine (meaning that your input/output is SQL, and all the steps are SQL-compatible), your solution with "now()" would work, because in that case, the now will only be evaluated once in a SQL query.

Ignacio_Toledo

Thanks for proposed solution @Nicolas_Servel. And for the tip about using a SQL engine instead... I was going to test it myself, but you provided the answer first 🙂

0 Kudos
johntarr
Level 2
Author

Appreciate the info on the enrich build info, but the SQL piece appears to be incorrect. I was getting different results for now(), even with the SQL engine. 

0 Kudos
Ignacio_Toledo

Interesting! Maybe it depends on what SQL database is being used? (Postgresql, oracle, mysql, etc.?)

0 Kudos
Nicolas_Servel
Dataiker
Dataiker

Hello, would you be able to share the SQL code generated for your prepare recipe so that I can verify on my side what is going on ?

You can find it by clicking on "View query" above the run button of the prepare recipe.

 

Best,

Nicolas

0 Kudos