How to get newly appended rows in sync recipe ?
Hello, this is not a huge problem but I'm sure there is a way to do this properly.
I'm syncing rows that I append to an SQL table. The "id" column of this table is auto incremented, and thus before syncing, the corresponding row is set to null (in order for it the have a correct id once in the SQL table).
In the case "id" is the only unique identifier, how do I know rows I've just inserted ?
Do I have to fill the id column manually ??? Is there a simple "hacky" way of doing things (getting the last n rows of the table...) or is there an option for this ?
Thank you
Best Answers
-
Grixis PartnerApplicant, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 74 ✭✭✭✭✭
Hey,
I don't think there is a way to manage it by design with dataiku visual components or something from your 'post-sync' dataset settings.
For me, the solution could be a custom code script for select the max/last id and descending your dataset. to avoid adding a step for this, perhaps you could do an SQL step directly to SYNC your output table.
-
For anyone reading this and having a similar problem, I just made a python recipe as suggested by this comment. It counts the rows before syncing and outputs the last n rows of the SQL table sorted on id. Pretty inefficient I think but it works.
However, you still lose some information, for example for two rows like :
id | date | description (table being talked about here)
101 | 11/02/24 | empty
102 | 11/02/24 | empty
—id | name (this one is only here to give some context)
101 | a
102 | b
It becomes clear that these two ids are interchangeable in the upper table as long as you're coherent.
Thank you all for your answers
Answers
-
LouisDHulst Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Registered, Neuron 2023 Posts: 53 Neuron
Hey @raphaffou ,
Is adding a column in your SQL with the timestamp of the sync possible? That way you can track when data was added. You could do it in a prepare recipe before the Sync, and would have to alter your SQL table.
You can also try using Metrics. If you calculate the max of "id" and set the metric to be computed each time the SQL table is built, you'll have an a lower and upper bound for "id" at each date. It's a bit more hacky but should work.