write_with_schema exists, how can I write only if the data is new?

info-rchitect · February 2022

Hi,

I have a Python recipe that creates datasets I want to write to two separate Snowflake tables. I would like to only write to the snowflake tables if the data is new. So, I need to be able to check if new data is in the table and then write if it is not.

thx

Operating system used: Windows 10

Alexandru · March 2022

Hi,

There is no way to really compare the data during the "write_with_schema" call. So you need to do this before.

If you can use partitions and build a new partition e.g Hourly if your data has a timestamp that you can use as a partition and you know that data is always new.

If not you can trigger a scenario based on dataset change or SQL query, these triggers may not be available depending on your license type.

Another approach is using metrics on your input dataset eg. count number of rows if a number of rows > the previous count then run the scenario

For the below to work, you would need to compute metrics on the dataset from the scenario.

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import json


client = dataiku.api_client()
project = client.get_default_project()


dataset = project.get_dataset("dataset_name")

historical_metrics_previous = dataset.get_metric_history('records:COUNT_RECORDS')['values'][1]
historical_metrics_current = dataset.get_metric_history('records:COUNT_RECORDS')['values'][0]

# add logic / conditions here 
print(historical_metrics_previous)
print(historical_metrics_current)

Let me know if that helps!

write_with_schema exists, how can I write only if the data is new?

Best Answer

Categories

Setup Info

Tags