write_with_schema exists, how can I write only if the data is new?

info-rchitect
info-rchitect Registered Posts: 189 ✭✭✭✭✭✭

Hi,

I have a Python recipe that creates datasets I want to write to two separate Snowflake tables. I would like to only write to the snowflake tables if the data is new. So, I need to be able to check if new data is in the table and then write if it is not.

thx


Operating system used: Windows 10

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
    Answer ✓

    Hi,

    There is no way to really compare the data during the "write_with_schema" call. So you need to do this before.

    If you can use partitions and build a new partition e.g Hourly if your data has a timestamp that you can use as a partition and you know that data is always new.

    If not you can trigger a scenario based on dataset change or SQL query, these triggers may not be available depending on your license type.

    Another approach is using metrics on your input dataset eg. count number of rows if a number of rows > the previous count then run the scenario

    For the below to work, you would need to compute metrics on the dataset from the scenario.

    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    import json


    client = dataiku.api_client()
    project = client.get_default_project()


    dataset = project.get_dataset("dataset_name")

    historical_metrics_previous = dataset.get_metric_history('records:COUNT_RECORDS')['values'][1]
    historical_metrics_current = dataset.get_metric_history('records:COUNT_RECORDS')['values'][0]

    # add logic / conditions here
    print(historical_metrics_previous)
    print(historical_metrics_current)

    Let me know if that helps!

Setup Info
    Tags
      Help me…