Retrieve last build date via API

LoicM
LoicM Registered Posts: 5 ✭✭✭✭
edited July 16 in Using Dataiku

Hello,

I am looking to retrieve the last time a dataset was built using the API.

This information is readliy available on the website appScreenshot 2020-08-24 at 15.01.03.png

I can even click on the link of the last build to get the exact datetime

Screenshot 2020-08-24 at 15.01.19.png

For my most recent datasets, it is relatively straightforward, I can look into the latest metrics values:

from dataiku import api_client

dataset = api_client.get_project("myproject").get_dataset("mydataset")
last_metrics = dataset.get_last_metric_values()
last_build_datetime = last_metrics.get_metric_by_id("reporting:BUILD_START_DATE")
>>> get a string that has the last build date in UTC

However, on older datasets, this metric is not present, meaning that I will get an:

Exception:  Metric reporting:BUILD_START_DATE not found among: ['basic:COUNT_COLUMNS', 'records:COUNT_RECORDS']

As the info is present on the web service for ALL datasets, I assume it is stored somewhere: I am however at a loss on how to get that info from the API for older table.

We made the transition from DSS 5.0 to 7.0 about a year ago, it seems - but I have not 100% certitude here - that these were the table built using DSS 5.0 are the ones that are not retrievable.

Best Answer

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Answer ✓

    Hi,

    BUILD_START_DATE is a magic metric that cannot be "computed" since it is only ever "set" by actually building a dataset.

    You can otherwise obtain the information about last builds of dataset by using the "Internal stats" dataset, in "Objects state" view. This dataset will then contain a line per dataset partition with the last build time. You can then load the dataframe corresponding to this Internal stats dataset in your own Python code, and lookup into it.

Answers

  • Liev
    Liev Dataiker Alumni Posts: 176 ✭✭✭✭✭✭✭✭

    Hi @LoicM

    This is indeed an interesting question. I imagine what's happening is that those old datasets have not had their metrics recalculated since the upgrade.

    Can you please confirm?

    Thank you

  • LoicM
    LoicM Registered Posts: 5 ✭✭✭✭
    edited July 17

    Indeed @Liev
    , in most of the cases these metrics were not calculated since the latest upgrade.

    Following your question, I tried recomputing them, first with the default:

    dataset.compute_metrics() # Will compute metrics setup on the dataset

    Which only recomputes the metrics that are already present.

    I then try to specify the metric that I wanted with the argument metric__ids:

    dataset.compute_metrics(metric_ids=["reporting:BUILD_START_DATE"])
    # Also tried in addition with all other metrics already present
    default_metrics = dataset.get_last_metric_values().get_all_ids()
    dataset.compute_metrics(
        metric_ids=default_metrics+["reporting:BUILD_START_DATE"])

    In both cases, the build start date was not made available, even though the computation raised no error.

    When computing only for BUILD_START_DATE, only the metric "reporting:METRICS_COMPUTATION_DURATION" was updated, whereas when including the default metrics, these default metrics were also included.

    From what I see in my workspace, rebuilding the dataset may solve the problem of missing metrics, but since I need this metric to actually know wether it makes sense to rebuild them in the first place, this is a bit of a chicken and egg problem

  • LoicM
    LoicM Registered Posts: 5 ✭✭✭✭

    Hey @Clément_Stenac

    Thanks for the advice, this seems to work for me !

    Having to pull the whole dataset may be a bit overkill, but I'll think about refactoring my code to get info on all the datasets on this internal stats dataset

Setup Info
    Tags
      Help me…