Sync Data - Delat load Automation
Hi All,
I have source table in SQL server and destination table in Hadoop. I would like only perform the delta load into destination table based on Timestamp. Does Sync has this option ?
Thanks
Best Answer
-
Hi,
To partition your SQL dataset in the way DSS can recognize, you will need a column which follows exactly this date format: yyyy-MM-dd(-hh).
If you do not have such a column, you will need preliminary processing to add it. You can use a Prepare recipe with the "Format date with custom format" step.
Hope it helps,
Alex
Answers
-
Hi,
A similar behaviour to "delta loading" can be achieved through "partitioning". You can find more on this page: https://doc.dataiku.com/dss/latest/partitions/index.html.
In a nutshell, you would define date partitioning on your SQL dataset, and use a Sync recipe to transfer the partition of "today" to the Hadoop dataset.
Hope it helps,
Alex
-
Thanks Alex for your reply, How I automate "Today" partition ? Assuming source and destination table ( already partitioned) is done.
-
Hi,
In a Scenario, you can use a number of keywords to build the partition for the "CURRENT_DAY", "PREVIOUS_DAY" or "CURRENT_MONTH". See screenshot below:
Hope it helps,
Alex
-
Thanks Alex. is there any way if the partition value is discrete ?
-
If the partition is discrete, you can use project variables to store the partition to build. Then, use the syntax "${variable}" in a scenario "Build dataset" step to build a specific partition.
Having said that, if you have incremental data and want to perform delta loading, we recommend using date partitioning instead of discrete partitioning.
-
Thanks Alex, for your support. I would like to go on date partitioning but I'm facing the following issue :
the date is this format : yyyy-mm-dd sssss
When i choose the partition by day, its picking the correct partitioning. However, the sync doesn't sync successfully because the query its doing like this "select etc where timestamp = yyyy-mm--dd" and returning 0 rows
-
Hi,
I have done some more testing on my side with the setup you describe.
The best way to do the date format processing in your case is to use a SQL recipe instead of a Prepare recipe. This way, no need to redispatch partitions through DSS or to compute an explicit list of partitions. Just use the SQL recipe with any partition identifier and it will populate the output with all the partitions.
Second, you can use the Sync recipe from this new SQL-partitioned dataset to your HDFS dataset.
Please find a video recording of this setup to clarify:
Hope it helps,
Alex
-
Thanks Alex for your great effort. I think i'm going to face an issue with this solution in Delta Loading
-
Hi Bader,
Could you detail the issue with that proposed solution?
After more research, I found an alternative solution without the SQL code recipe, but with a Prepare recipe which can be translated to SQL behind the hood. That's only one step
I have tested it on a SQL Server on Azure and it works as expected to sync data from the SQL Server to a filesystem. Please find attached the project export as an example.
Do not hesitate if you have further questions.
Cheers,
Alex
-
Many thanks Alex, I will details the issue.
1- Lets assume the example below source tables from SQL database
Note: Sync_Timestamp is in String type
2- I would like first to do full load into hdfs with partition by Sync_Timestamp
3- I'm expecting result like this " peojectName/ 2019-12-02/ data"
" peojectName/ 2019-12-03/ data"
"projectName/2019-05-02/data"
4- Once everything is fine, I would like to do the Delta load
Table 1
id
name
Sync_Timestamp
1
Name1
2019-12-02 11:01:10.076543
2
Name2
2019-12-03 11:01:10.076543
3
Name3
2019-05-02 11:01:10.076543
4
Name4
2019-12-02 11:01:10.076543
5
Name5
2019-07-02 11:01:10.076543
Table 2
ID
Department
Sync_Timestamp
1
Dep1
12/24/2019 18:59:20.993939
2
Dep2
03/05/2019 18:59:20.993469
3
Dep3
12/24/2019 18:59:20.993439
4
Dep4
03/05/2019 18:59:20.993939
5
Dep5
12/20/2019 18:59:20.998889
Table 3
ID
Location
Sync_Timestamp
1
Loc1
20191029141148
2
Loc2
20181129141148
3
Loc3
20191029141148
4
Loc4
20181129141148
5
Loc5
20171029141148
-
Cont.. just to let you know my timestamp in string type
-
Hi,
For the partition in DSS to work, you will need the timestamp in date type or in "DSS-partition" string type (YYYY-MM-DD). So you will need an intermediary step to convert your string timestamp into one of these types. This step can be a SQL recipe or a Prepare recipe.
Regarding your requirements to do a full load then a delta load, this can be done with 2 scenarios:
1. "Full rebuild scenario": create the following scenario steps
- compute the list of all available partitions in your partitioned SQL dataset,
- store it in a project variable (call it "myVar" for instance),
- build the output HDFS dataset for the partition ${myVar}
2. "Incremental build scenario": build the output HDFS dataset for the partition CURRENT_DAY
I hope that's clear enough. Let me know if you have further questions, happy to detail each step and provide examples.
Cheers,
Alex