Data auto refresh

Options
unkxowx
unkxowx Registered Posts: 19

Hey,

I have a flow with multiple tables imported from snowflake - source tables. My flow contains a lot of recipes (stored into filesystem_managed) that originate from these source tables and some of the subsequent datasets formed have metrics and checks in place (auto-compute on dataset build), that are used on a dashboard in dataiku.

I am trying to see the most feasible way of doing a data refresh in dataiku and have tried 2 ways.

First Way:
Here I start using recipes straight up with my source tables. I do not create a 'sync' recipe and then use other recipes on the 'synced' version.

I have a scenario - named refresh - that has 2 steps:
1) Build the source datasets only
2) Build the downstream flow from these source datasets
And this scenario is used as a button in the dashboard. The thing is each time i click on refresh, i can see the date and time being changed on the metrics and checks that are in the datasets that come after the source table. However, the metrics from my source table (row and column count) do not change till the time i manually go into the dataset -> status -> compute metrics and then this refreshed dataset displays changes in the dashboard tiles as well.

Second Way:
here, i have used 'sync' recipe on each of my source dataset and all my other recipes start thereafter. As a result, my scenario has only 1 step i.e., Build the downstream flow from these source datasets. In this case, the metrics like row and column count are being set up in the 'synced' dataset instead of the source. What I have noticed is that upon clicking refresh scenario in the dashboard, everything gets updated. I don't have to go to each source to manually compute the metric first and update the dataset. However, syncing takes too long. The sources have millions of rows, and it takes hours to just refresh everything like this.

What I need:
1) an efficient way to refresh my data so that the downstream flow can have a reliable source and ofc changes get reflected in the dashboard.
2) Sometimes I have seen in my First Way, that if let's say after my first refresh, I change something in my snowflake table - let's say delete all records - and then in dataiku go to the source dataset -> status -> compute metrics, it will show me 0 in record count but explore tab still shows me records from prev refresh. So, I am looking for a way in which my data just gets refreshed properly.

Thank you for your assistance with this :))

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,726 Neuron
    Options

    Can you post a picture of your flow in particular the first part where you have some sources and indicate which datasets don't get build in your "first way". Also a screen shot of the scenario on 1) "Build the source datasets only". Thanks

  • unkxowx
    unkxowx Registered Posts: 19
    Options

    Hey sure:

    1) First way:

    1.PNG

    the source table (highlighted) does not get updated.

    2) Second way:

    2.PNG

     in this case, my metrics are in the synced dataset

    3) Build only source:

    3.PNG

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,726 Neuron
    Options

    You have a number of fundamental misconceptions on how Dataiku works which lead to the problems you describe in your post. To begin with you do not see your metrics updated in your Snowflake dataset because that dataset is never being build by your scenario. In Dataiku building a dataset means executing the recipe that generates that dataset (ie build the recipe output). Since your dataset is an external table (what Dataiku calls an "unmamaged" or "external" dataset but most people tend to call them "input" datasets as they are the input for your flow) which is not being build by your flow there is nothing to build so despite you asking for it to be build in your scenario Dataiku will ignore your request and do nothing. If you go to your scenario, run it, then go at last runs and click on the Triggered job Scenario_build_XXX link of the step that builds your Snowflake dataset you should a message that says "There was nothing to do for this job" which will confirm what I am saying. Now that you understand there is nothing to build you can then rephrase your question as: how can I update the metrics and checks for my input datasets? Simple, you add a "Compute metrics" step in your scenario and add your input datasets to the step so that the metrics can be calculated as part of your scenario run.

    With regards to your second issue: "it will show me 0 in record count but explore tab still shows me records from prev refresh" you have another misconception. The explore tab is not your dataset, nor a live representation of your dataset, it's simply a static data sample of it, usually 10000 rows only. This data sample is merely there to aid your flow development and data exploration, it does not represent your whole dataset. Now for managed datasets that get rebuild when you run a recipe the sample will be updated automatically when the data or schema changes. However this will not be done for input datasets as per my explanation above they are never built by Dataiku. Furthermore the sample data of a dataset does not need to match the metrics and checks nor being from the same time. So if you wish to update the dataset sample just click on the sample button and update it. Depending on the dataset size and the available memory sometimes you can sample 100% of a dataset but it's not something that's recommended. Of course when you run a recipe the FULL dataset will be used, irrespective of the state of the data sample, which is only used in the Explore tab.

    Finally we come down to the "an efficient way to refresh my data" question. The easiest way to automate the refresh of a flow like the one you have designed is to add a SQL query change trigger to your scenario. This SQL query can be very simple, either doing a row count or selecting the max date from your Snowflake table such that when the value of the returned data changes Dataiku will trigger the scenario and refresh your flow and your dashboard.

  • unkxowx
    unkxowx Registered Posts: 19
    Options

    Understood and following on what you said above:
    I initially didn't add Compute Metrics step as they were on auto-compute, and it made sense to me to "build" them first but, I of course had a different definition of "build" and same goes for the 0 sample. Of course, the dataset shows a sample, but it "technically" made sense to not show anything when there are no rows in there. I see what you're saying and that clarifies everything. Also, I had setup refresh with SQL query change earlier as well but disabled it later as I was trying a few other things, think it'll be great to re-enable it.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,726 Neuron
    Options

    I agree that it can be confusing to manually calculate metrics, get zero rows and then go to the explore tab and see data. You may think that Dataiku could perhaps in this case trigger a sample refresh in the background to in effect have no rows in the sample/explore tab. But I think this would be a slippery slope. How many other situations would require the data sample to be updated? New columns found in the columns count metric? New data values seen in the metrics? And what about the other way around, when you see zero rows in the data sample and the row count metric says there are rows and it's outdated? Furthermore when people run a data sample update they don't expect metrics to be updated and likewise when they update metrics they don't expect data samples to change. You see it quickly becomes a really complicated issue if you want one feature to depend on the other and for that reason the way it currently works is the most sensible way, once you understand how the features work and their intended purpose. So you really need to treat data samples and metrics as completely different even though in some cases they may update at the same time.

    If you want to see live data from your SQL datasets you will be better off using a SQL Notebook and run a live SQL query on the table which by the way will also retrieve only the first 10,000 rows by default. For other datsets use a Jupyter Notebook and load the dataset into a dataframe so you can inspect it in the notebook. Don't expect the data sample to be anything else other than that, and if you want to see an updated sample then you should request it from the GUI.

Setup Info
    Tags
      Help me…