Sync Data - Delat load Automation

Bader
Bader Registered Posts: 46 ✭✭✭✭✭

Hi All,

I have source table in SQL server and destination table in Hadoop. I would like only perform the delta load into destination table based on Timestamp. Does Sync has this option ?

Thanks

Best Answer

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    Answer ✓

    Hi,

    To partition your SQL dataset in the way DSS can recognize, you will need a column which follows exactly this date format: yyyy-MM-dd(-hh).

    If you do not have such a column, you will need preliminary processing to add it. You can use a Prepare recipe with the "Format date with custom format" step.

    Hope it helps,

    Alex

Answers

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭

    Hi,

    A similar behaviour to "delta loading" can be achieved through "partitioning". You can find more on this page: https://doc.dataiku.com/dss/latest/partitions/index.html.

    In a nutshell, you would define date partitioning on your SQL dataset, and use a Sync recipe to transfer the partition of "today" to the Hadoop dataset.

    Hope it helps,

    Alex

  • Bader
    Bader Registered Posts: 46 ✭✭✭✭✭

    Thanks Alex for your reply, How I automate "Today" partition ? Assuming source and destination table ( already partitioned) is done.

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭

    Hi,

    In a Scenario, you can use a number of keywords to build the partition for the "CURRENT_DAY", "PREVIOUS_DAY" or "CURRENT_MONTH". See screenshot below:

    Screenshot 2020-03-14 at 20.31.56.png

    Hope it helps,

    Alex

  • Bader
    Bader Registered Posts: 46 ✭✭✭✭✭

    Thanks Alex. is there any way if the partition value is discrete ?

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭

    If the partition is discrete, you can use project variables to store the partition to build. Then, use the syntax "${variable}" in a scenario "Build dataset" step to build a specific partition.

    Having said that, if you have incremental data and want to perform delta loading, we recommend using date partitioning instead of discrete partitioning.

  • Bader
    Bader Registered Posts: 46 ✭✭✭✭✭

    Thanks Alex, for your support. I would like to go on date partitioning but I'm facing the following issue :

    the date is this format : yyyy-mm-dd sssss

    When i choose the partition by day, its picking the correct partitioning. However, the sync doesn't sync successfully because the query its doing like this "select etc where timestamp = yyyy-mm--dd" and returning 0 rows

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭

    Hi,

    I have done some more testing on my side with the setup you describe.

    The best way to do the date format processing in your case is to use a SQL recipe instead of a Prepare recipe. This way, no need to redispatch partitions through DSS or to compute an explicit list of partitions. Just use the SQL recipe with any partition identifier and it will populate the output with all the partitions.

    Second, you can use the Sync recipe from this new SQL-partitioned dataset to your HDFS dataset.

    Please find a video recording of this setup to clarify:

    Hope it helps,

    Alex

  • Bader
    Bader Registered Posts: 46 ✭✭✭✭✭

    Thanks Alex for your great effort. I think i'm going to face an issue with this solution in Delta Loading

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭

    Hi Bader,

    Could you detail the issue with that proposed solution?

    After more research, I found an alternative solution without the SQL code recipe, but with a Prepare recipe which can be translated to SQL behind the hood. That's only one step

    I have tested it on a SQL Server on Azure and it works as expected to sync data from the SQL Server to a filesystem. Please find attached the project export as an example.

    Do not hesitate if you have further questions.

    Cheers,

    Alex

  • Bader
    Bader Registered Posts: 46 ✭✭✭✭✭

    Many thanks Alex, I will details the issue.

    1- Lets assume the example below source tables from SQL database

    Note: Sync_Timestamp is in String type

    2- I would like first to do full load into hdfs with partition by Sync_Timestamp

    3- I'm expecting result like this " peojectName/ 2019-12-02/ data"

    " peojectName/ 2019-12-03/ data"

    "projectName/2019-05-02/data"

    4- Once everything is fine, I would like to do the Delta load

    Table 1

    id

    name

    Sync_Timestamp

    1

    Name1

    2019-12-02 11:01:10.076543

    2

    Name2

    2019-12-03 11:01:10.076543

    3

    Name3

    2019-05-02 11:01:10.076543

    4

    Name4

    2019-12-02 11:01:10.076543

    5

    Name5

    2019-07-02 11:01:10.076543

    Table 2

    ID

    Department

    Sync_Timestamp

    1

    Dep1

    12/24/2019 18:59:20.993939

    2

    Dep2

    03/05/2019 18:59:20.993469

    3

    Dep3

    12/24/2019 18:59:20.993439

    4

    Dep4

    03/05/2019 18:59:20.993939

    5

    Dep5

    12/20/2019 18:59:20.998889

    Table 3

    ID

    Location

    Sync_Timestamp

    1

    Loc1

    20191029141148

    2

    Loc2

    20181129141148

    3

    Loc3

    20191029141148

    4

    Loc4

    20181129141148

    5

    Loc5

    20171029141148

  • Bader
    Bader Registered Posts: 46 ✭✭✭✭✭

    Cont.. just to let you know my timestamp in string type

  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭

    Hi,

    For the partition in DSS to work, you will need the timestamp in date type or in "DSS-partition" string type (YYYY-MM-DD). So you will need an intermediary step to convert your string timestamp into one of these types. This step can be a SQL recipe or a Prepare recipe.

    Regarding your requirements to do a full load then a delta load, this can be done with 2 scenarios:

    1. "Full rebuild scenario": create the following scenario steps

    - compute the list of all available partitions in your partitioned SQL dataset,

    - store it in a project variable (call it "myVar" for instance),

    - build the output HDFS dataset for the partition ${myVar}

    2. "Incremental build scenario": build the output HDFS dataset for the partition CURRENT_DAY

    I hope that's clear enough. Let me know if you have further questions, happy to detail each step and provide examples.

    Cheers,

    Alex

Setup Info
    Tags
      Help me…