Sync Data - Delat load Automation

Solved!
Bader
Level 3
Sync Data - Delat load Automation

Hi All, 

 

I have source table in SQL server and destination table in Hadoop. I would like only perform the delta load into destination table based on Timestamp. Does Sync has this option ? 

Thanks   

0 Kudos
1 Solution
Alex_Combessie
Dataiker Alumni

Hi,

To partition your SQL dataset in the way DSS can recognize, you will need a column which follows exactly this date format: yyyy-MM-dd(-hh).

If you do not have such a column, you will need preliminary processing to add it. You can use a Prepare recipe with the "Format date with custom format" step.

Hope it helps,

Alex

View solution in original post

13 Replies
Alex_Combessie
Dataiker Alumni

Hi,

A similar behaviour to "delta loading" can be achieved through "partitioning". You can find more on this page: https://doc.dataiku.com/dss/latest/partitions/index.html.

In a nutshell, you would define date partitioning on your SQL dataset, and use a Sync recipe to transfer the partition of "today" to the Hadoop dataset.

Hope it helps,

Alex

0 Kudos
Bader
Level 3
Author

Thanks Alex for your reply, How I automate "Today" partition ?  Assuming source and destination table ( already partitioned)  is done.

0 Kudos
Alex_Combessie
Dataiker Alumni

Hi,

In a Scenario, you can use a number of keywords to build the partition for the "CURRENT_DAY", "PREVIOUS_DAY" or "CURRENT_MONTH". See screenshot below:

Screenshot 2020-03-14 at 20.31.56.png

Hope it helps,

Alex

0 Kudos
Bader
Level 3
Author

Thanks Alex. is there any way if the partition value is discrete ?   

0 Kudos
Alex_Combessie
Dataiker Alumni

If the partition is discrete, you can use project variables to store the partition to build. Then, use the syntax "${variable}" in a scenario "Build dataset" step to build a specific partition.

Having said that, if you have incremental data and want to perform delta loading, we recommend using date partitioning instead of discrete partitioning.

0 Kudos
Bader
Level 3
Author

Thanks Alex, for your support. I would like to go on date partitioning but I'm facing the following issue :  

the date is this format : yyyy-mm-dd sssss

When i choose the partition by day, its picking the correct partitioning. However,  the sync doesn't sync successfully because the query its doing like this "select etc where timestamp = yyyy-mm--dd" and returning 0 rows  

0 Kudos
Alex_Combessie
Dataiker Alumni

Hi,

To partition your SQL dataset in the way DSS can recognize, you will need a column which follows exactly this date format: yyyy-MM-dd(-hh).

If you do not have such a column, you will need preliminary processing to add it. You can use a Prepare recipe with the "Format date with custom format" step.

Hope it helps,

Alex

Alex_Combessie
Dataiker Alumni

Hi,

I have done some more testing on my side with the setup you describe.

The best way to do the date format processing in your case is to use a SQL recipe instead of a Prepare recipe. This way, no need to redispatch partitions through DSS or to compute an explicit list of partitions. Just use the SQL recipe with any partition identifier and it will populate the output with all the partitions.

Second, you can use the Sync recipe from this new SQL-partitioned dataset to your HDFS dataset.

Please find a video recording of this setup to clarify:

Hope it helps,

Alex

0 Kudos
Bader
Level 3
Author

Thanks Alex for your great effort. I think i'm going to face an issue with this solution in Delta Loading 

0 Kudos
Alex_Combessie
Dataiker Alumni

Hi Bader,

Could you detail the issue with that proposed solution?

After more research, I found an alternative solution without the SQL code recipe, but with a Prepare recipe which can be translated to SQL behind the hood. That's only one step ๐Ÿ™‚

I have tested it on a SQL Server on Azure and it works as expected to sync data from the SQL Server to a filesystem. Please find attached the project export as an example.

Do not hesitate if you have further questions.

Cheers,

Alex

0 Kudos
Bader
Level 3
Author

Many thanks Alex, I will details the issue.  

1- Lets assume the example below  source tables  from SQL database

Note: Sync_Timestamp is in String type

2- I would like first to do full load into hdfs with partition by Sync_Timestamp 

3- I'm expecting result like this  " peojectName/ 2019-12-02/ data" 

                                                                  " peojectName/ 2019-12-03/ data"

                                                                   "projectName/2019-05-02/data"

4- Once everything is fine, I would like to do the Delta load 

 

Table 1 

 

id

name

Sync_Timestamp

1

Name1

2019-12-02 11:01:10.076543

2

Name2

2019-12-03 11:01:10.076543

3

Name3

2019-05-02 11:01:10.076543

4

Name4

2019-12-02 11:01:10.076543

5

Name5

2019-07-02 11:01:10.076543

Table 2

ID

Department

Sync_Timestamp

1

Dep1

12/24/2019 18:59:20.993939

2

Dep2

03/05/2019 18:59:20.993469

3

Dep3

12/24/2019 18:59:20.993439

4

Dep4

03/05/2019 18:59:20.993939

5

Dep5

12/20/2019 18:59:20.998889

Table 3

ID

Location

Sync_Timestamp

 

 

 

 

1

Loc1

20191029141148

 

 

 

 

2

Loc2

20181129141148

 

 

 

 

3

Loc3

20191029141148

 

 

 

 

4

Loc4

20181129141148

 

 

 

 

5

Loc5

20171029141148

 

 

 

 

 

0 Kudos
Bader
Level 3
Author

Cont.. just to let you know my timestamp in string type 

0 Kudos
Alex_Combessie
Dataiker Alumni

Hi,

For the partition in DSS to work, you will need the timestamp in date type or in "DSS-partition" string type (YYYY-MM-DD). So you will need an intermediary step to convert your string timestamp into one of these types. This step can be a SQL recipe or a Prepare recipe.

Regarding your requirements to do a full load then a delta load, this can be done with 2 scenarios:

1. "Full rebuild scenario": create the following scenario steps

- compute the list of all available partitions in your partitioned SQL dataset,

- store it in a project variable (call it "myVar" for instance),

- build the output HDFS dataset for the partition ${myVar}

2. "Incremental build scenario": build the output HDFS dataset for the partition CURRENT_DAY

I hope that's clear enough. Let me know if you have further questions, happy to detail each step and provide examples.

Cheers,

Alex

0 Kudos