write_with_schema exists, how can I write only if the data is new?

info-rchitect Registered Posts: 169 ✭✭✭✭✭✭


I have a Python recipe that creates datasets I want to write to two separate Snowflake tables. I would like to only write to the snowflake tables if the data is new. So, I need to be able to check if new data is in the table and then write if it is not.


Operating system used: Windows 10

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Answer ✓


    There is no way to really compare the data during the "write_with_schema" call. So you need to do this before.

    If you can use partitions and build a new partition e.g Hourly if your data has a timestamp that you can use as a partition and you know that data is always new.

    If not you can trigger a scenario based on dataset change or SQL query, these triggers may not be available depending on your license type.

    Another approach is using metrics on your input dataset eg. count number of rows if a number of rows > the previous count then run the scenario

    For the below to work, you would need to compute metrics on the dataset from the scenario.

    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    import json

    client = dataiku.api_client()
    project = client.get_default_project()

    dataset = project.get_dataset("dataset_name")

    historical_metrics_previous = dataset.get_metric_history('records:COUNT_RECORDS')['values'][1]
    historical_metrics_current = dataset.get_metric_history('records:COUNT_RECORDS')['values'][0]

    # add logic / conditions here

    Let me know if that helps!

Setup Info
      Help me…