Sync Data - Delat load Automation

Bader · March 2020

Hi All,

I have source table in SQL server and destination table in Hadoop. I would like only perform the delta load into destination table based on Timestamp. Does Sync has this option ?

Thanks

Alex_Combessie · March 2020

Hi,

To partition your SQL dataset in the way DSS can recognize, you will need a column which follows exactly this date format: yyyy-MM-dd(-hh).

If you do not have such a column, you will need preliminary processing to add it. You can use a Prepare recipe with the "Format date with custom format" step.

Hope it helps,

Alex

Alex_Combessie · March 2020

Hi,

A similar behaviour to "delta loading" can be achieved through "partitioning". You can find more on this page: https://doc.dataiku.com/dss/latest/partitions/index.html.

In a nutshell, you would define date partitioning on your SQL dataset, and use a Sync recipe to transfer the partition of "today" to the Hadoop dataset.

Hope it helps,

Alex

Bader · March 2020

Thanks Alex for your reply, How I automate "Today" partition ? Assuming source and destination table ( already partitioned) is done.

Alex_Combessie · March 2020

Hi,

In a Scenario, you can use a number of keywords to build the partition for the "CURRENT_DAY", "PREVIOUS_DAY" or "CURRENT_MONTH". See screenshot below:

Screenshot 2020-03-14 at 20.31.56.png

Hope it helps,

Alex

Bader · March 2020

Thanks Alex. is there any way if the partition value is discrete ?

Alex_Combessie · March 2020

If the partition is discrete, you can use project variables to store the partition to build. Then, use the syntax "${variable}" in a scenario "Build dataset" step to build a specific partition.

Having said that, if you have incremental data and want to perform delta loading, we recommend using date partitioning instead of discrete partitioning.

Bader · March 2020

Thanks Alex, for your support. I would like to go on date partitioning but I'm facing the following issue :

the date is this format : yyyy-mm-dd sssss

When i choose the partition by day, its picking the correct partitioning. However, the sync doesn't sync successfully because the query its doing like this "select etc where timestamp = yyyy-mm--dd" and returning 0 rows

Alex_Combessie · March 2020

Hi,

I have done some more testing on my side with the setup you describe.

The best way to do the date format processing in your case is to use a SQL recipe instead of a Prepare recipe. This way, no need to redispatch partitions through DSS or to compute an explicit list of partitions. Just use the SQL recipe with any partition identifier and it will populate the output with all the partitions.

Second, you can use the Sync recipe from this new SQL-partitioned dataset to your HDFS dataset.

Please find a video recording of this setup to clarify:

Hope it helps,

Alex

Bader · March 2020

Thanks Alex for your great effort. I think i'm going to face an issue with this solution in Delta Loading

Alex_Combessie · March 2020

Hi Bader,

Could you detail the issue with that proposed solution?

After more research, I found an alternative solution without the SQL code recipe, but with a Prepare recipe which can be translated to SQL behind the hood. That's only one step

I have tested it on a SQL Server on Azure and it works as expected to sync data from the SQL Server to a filesystem. Please find attached the project export as an example.

Do not hesitate if you have further questions.

Cheers,

Alex

Bader · March 2020

Many thanks Alex, I will details the issue.

1- Lets assume the example below source tables from SQL database

Note: Sync_Timestamp is in String type

2- I would like first to do full load into hdfs with partition by Sync_Timestamp

3- I'm expecting result like this " peojectName/ 2019-12-02/ data"

" peojectName/ 2019-12-03/ data"

"projectName/2019-05-02/data"

4- Once everything is fine, I would like to do the Delta load

Table 1

id	name	Sync_Timestamp
1	Name1	2019-12-02 11:01:10.076543
2	Name2	2019-12-03 11:01:10.076543
3	Name3	2019-05-02 11:01:10.076543
4	Name4	2019-12-02 11:01:10.076543
5	Name5	2019-07-02 11:01:10.076543

Table 2

ID	Department	Sync_Timestamp
1	Dep1	12/24/2019 18:59:20.993939
2	Dep2	03/05/2019 18:59:20.993469
3	Dep3	12/24/2019 18:59:20.993439
4	Dep4	03/05/2019 18:59:20.993939
5	Dep5	12/20/2019 18:59:20.998889

Table 3

ID	Location	Sync_Timestamp
1	Loc1	20191029141148
2	Loc2	20181129141148
3	Loc3	20191029141148
4	Loc4	20181129141148
5	Loc5	20171029141148

Bader · March 2020

Cont.. just to let you know my timestamp in string type

Alex_Combessie · March 2020

Hi,

For the partition in DSS to work, you will need the timestamp in date type or in "DSS-partition" string type (YYYY-MM-DD). So you will need an intermediary step to convert your string timestamp into one of these types. This step can be a SQL recipe or a Prepare recipe.

Regarding your requirements to do a full load then a delta load, this can be done with 2 scenarios:

1. "Full rebuild scenario": create the following scenario steps

- compute the list of all available partitions in your partitioned SQL dataset,

- store it in a project variable (call it "myVar" for instance),

- build the output HDFS dataset for the partition ${myVar}

2. "Incremental build scenario": build the output HDFS dataset for the partition CURRENT_DAY

I hope that's clear enough. Let me know if you have further questions, happy to detail each step and provide examples.

Cheers,

Alex

Sync Data - Delat load Automation

Best Answer

Answers

Categories

Setup Info

Tags