Submit your inspiring success story or innovative use case to the 2022 Dataiku Frontrunner Awards! ENTER YOUR SUBMISSION

write_with_schema exists, how can I write only if the data is new?

Solved!
info-rchitect
Level 3
write_with_schema exists, how can I write only if the data is new?

Hi,

 

I have a Python recipe that creates datasets I want to write to two separate Snowflake tables.  I would like to only write to the snowflake tables if the data is new.  So, I need to be able to check if new data is in the table and then write if it is not.  

 

thx


Operating system used: Windows 10

0 Kudos
1 Solution
AlexT
Dataiker
Dataiker

Hi,

There is no way to really compare the data during the "write_with_schema" call. So you need to do this before. 

If you can use partitions and build a new partition e.g Hourly if your data has a timestamp that you can use as a partition and you know that data is always new. 

If not you can trigger a scenario based on dataset change or SQL query, these triggers may not be available depending on your license type. 

Another approach is using metrics on your input dataset eg.  count number of rows if a number of rows > the previous count then run the scenario

For the below to work, you would need to compute metrics on the dataset from the scenario. 

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import json


client = dataiku.api_client()
project = client.get_default_project()


dataset = project.get_dataset("dataset_name")

historical_metrics_previous = dataset.get_metric_history('records:COUNT_RECORDS')['values'][1]
historical_metrics_current = dataset.get_metric_history('records:COUNT_RECORDS')['values'][0]

# add logic / conditions here
print(historical_metrics_previous)
print(historical_metrics_current)

Let me know if that helps!

View solution in original post

0 Kudos
1 Reply
AlexT
Dataiker
Dataiker

Hi,

There is no way to really compare the data during the "write_with_schema" call. So you need to do this before. 

If you can use partitions and build a new partition e.g Hourly if your data has a timestamp that you can use as a partition and you know that data is always new. 

If not you can trigger a scenario based on dataset change or SQL query, these triggers may not be available depending on your license type. 

Another approach is using metrics on your input dataset eg.  count number of rows if a number of rows > the previous count then run the scenario

For the below to work, you would need to compute metrics on the dataset from the scenario. 

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import json


client = dataiku.api_client()
project = client.get_default_project()


dataset = project.get_dataset("dataset_name")

historical_metrics_previous = dataset.get_metric_history('records:COUNT_RECORDS')['values'][1]
historical_metrics_current = dataset.get_metric_history('records:COUNT_RECORDS')['values'][0]

# add logic / conditions here
print(historical_metrics_previous)
print(historical_metrics_current)

Let me know if that helps!

0 Kudos