write_with_schema exists, how can I write only if the data is new?
Hi,
I have a Python recipe that creates datasets I want to write to two separate Snowflake tables. I would like to only write to the snowflake tables if the data is new. So, I need to be able to check if new data is in the table and then write if it is not.
thx
Operating system used: Windows 10
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi,
There is no way to really compare the data during the "write_with_schema" call. So you need to do this before.
If you can use partitions and build a new partition e.g Hourly if your data has a timestamp that you can use as a partition and you know that data is always new.
If not you can trigger a scenario based on dataset change or SQL query, these triggers may not be available depending on your license type.
Another approach is using metrics on your input dataset eg. count number of rows if a number of rows > the previous count then run the scenario
For the below to work, you would need to compute metrics on the dataset from the scenario.
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import json
client = dataiku.api_client()
project = client.get_default_project()
dataset = project.get_dataset("dataset_name")
historical_metrics_previous = dataset.get_metric_history('records:COUNT_RECORDS')['values'][1]
historical_metrics_current = dataset.get_metric_history('records:COUNT_RECORDS')['values'][0]
# add logic / conditions here
print(historical_metrics_previous)
print(historical_metrics_current)Let me know if that helps!