Dataiku Pyhton API - Get project timeline infomation (project editting info)

ArvinUbhi · October 2021

I have been trying to find a method of gathering the infomation that the project timeline provides i.e. the edit history. I would like to the use the Python API for this. Please can you advise me as to how I can do this?

Alexandru · October 2021

Hi @ArvinUbhi
,

Git history is not available directly via the Public Python APIs. You can view the edit history under Version Control from the UI for a project. This will show the history of the edits to the project metadata,

The easiest way to retrieve this data and use it in Python code would be to create an internal commits database as explained here.

Then use that respective dataset as needed. Let me know if the internal commits dataset would work for your use case.

Thanks,

ArvinUbhi · October 2021

Hi @AlexT
,

Thanks for your response.

Unfortunatley, im running it on every project in the environment as I a formulating a report on which projects have the most recent or more frequent activity so starting a internal commits dataset wouldnt work in this case (however it is very handy to know that, so thank you)

I have read that theres a realtime database that dataiku maintains (which can also be written ut to a postgres db) which contains the timeline infomation on a project. I guess my question is how can I access that via a python script within a dss environment?

Alexandru · October 2021

Hi @ArvinUbhi
,

What DSS version are you on?

You are able to generate both the commits and human-readable history for all projects by leaving the project key blank, so you only need to build this dataset in a single project.

See below examples :

Screenshot 2021-10-19 at 17.44.50.png Screenshot 2021-10-19 at 17.45.16.png

This would provide the equivalent tables as if you directly querying the runtime database which is NOT advised. You can import these datasets into your python notebook, recipe, etc, and manipulate this as you wish.

Let me know if I misunderstood anything and if the above can work for your use case.

Alexandru · October 2021

Sorry, I should have noted that what you are likely looking for is the "Objects States" dataset instead of the commits. As this the same information that would be in the project timeline.

ArvinUbhi · October 2021

I am on version 8.0.3.

Thats very helpful. Is there a way that I can access these directly from python script without building a dataiku dataset ie from a notebook or scenario?

Alexandru · October 2021

Not exactly, you will have to create the internal dataset in a project. For example, you can have a "monitoring" project where you create the dataset and then use it from Scenario/Notebook.

# Example: load a DSS dataset as a Pandas dataframe
import dataiku

mydataset = dataiku.Dataset("object_states_all_projects")
mydataset_df = mydataset.get_dataframe()
mydataset_df

If you prefer to use this as an SQL database you can use Sync Recipe to Sync the internal dataset to your database. Then access it either a dataset or using SQLExecutor.

import dataiku
from dataiku import SQLExecutor2

executor = SQLExecutor2(connection="my-sql-database") # or dataset="dataset_name"

df = executor.query_to_df('SELECT * from "internal_objects_database_synced"')

Hope this helps let me know if you have any other questions.

ArvinUbhi · October 2021

Thank you! That helps alot. Can you please show me a script to access the scenarios table in the internal runtime db and the commits table?

Alexandru · October 2021

Hi,

You can create the Internal StatsDB datasets from the Python API directly, again I would highly discourage accessing the runtime database directly hence using the Internal Stats datasets is preferred. TO create all 4 types of datasets from API you can use :

import dataiku
import dataikuapi
import pandas as pd, numpy as np

# retrieve dataset details 
client = dataiku.api_client()
project_key = 'INTERNAL_STATS'
project = client.get_project('INTERNAL_STATS')

# Different types of stats DB 

#Cluster TASKS
params_defined = { "view": "CLUSTER_TASKS"}
dataset_type = "StatsDB"
dataset_name = "cluster_tasks_v1"
dataset_create = project.create_dataset(dataset_name, dataset_type, params_defined)

#internal stats datasets created without schema so hard coding to existing schema which would be created via UI
dataset = dataikuapi.dss.dataset.DSSDataset(client,project_key,dataset_name)
schema_to_set = {'columns': [{'name': 'connection', 'type': 'string'}, {'name': 'task_type', 'type': 'string'}, {'name': 'project_key', 'type': 'string'}, {'name': 'task_data', 'type': 'string'}, {'name': 'user', 'type': 'string'}, {'name': 'start_time', 'type': 'bigint'}, {'name': 'end_time', 'type': 'bigint'}], 'userModified': True}
dataset.set_schema(schema_to_set)

# COMMITS
params_defined = { "view": "COMMITS"}
dataset_type = "StatsDB"
dataset_name = "commits_v1"
dataset_create = project.create_dataset(dataset_name, dataset_type, params_defined)

#internal stats datasets created without schema so hard coding to existing schema which would be created via UI
dataset = dataikuapi.dss.dataset.DSSDataset(client,project_key,dataset_name)
schema_to_set = {'columns': [{'name': 'project_key', 'type': 'string'}, {'name': 'commit_id', 'type': 'string'}, {'name': 'author', 'type': 'string'}, {'name': 'timestamp', 'type': 'bigint'}, {'name': 'added_files', 'type': 'int'}, {'name': 'added_lines', 'type': 'int'}, {'name': 'removed_files', 'type': 'int'}, {'name': 'removed_lines', 'type': 'int'}, {'name': 'changed_files', 'type': 'int'}], 'userModified': True}
dataset.set_schema(schema_to_set)


#FLOW ACTIONS 
params_defined = { "view": "FLOW_ACTIONS"}
dataset_type = "StatsDB"
dataset_name = "flow_actions_v1"
dataset = project.create_dataset(dataset_name, dataset_type, params_defined)

#internal stats datasets created without schema so hard coding to existing schema which would be created via UI
dataset = dataikuapi.dss.dataset.DSSDataset(client,project_key,dataset_name)
schema_to_set = {'columns': [{'name': 'project_key', 'type': 'string'}, {'name': 'object_id', 'type': 'string'}, {'name': 'partition', 'type': 'string'}, {'name': 'job_project_key', 'type': 'string'}, {'name': 'job_id', 'type': 'string'}, {'name': 'activity_id', 'type': 'string'}, {'name': 'scenario_project_key', 'type': 'string'}, {'name': 'scenario_id', 'type': 'string'}, {'name': 'scenario_run_id', 'type': 'string'}, {'name': 'step_id', 'type': 'string'}, {'name': 'step_run_id', 'type': 'string'}, {'name': 'time_start', 'type': 'date'}, {'name': 'time_end', 'type': 'date'}, {'name': 'outcome', 'type': 'string'}, {'name': 'result', 'type': 'string'}, {'name': 'warnings_count', 'type': 'bigint'}, {'name': 'type', 'type': 'string'}, {'name': 'is_last', 'type': 'boolean'}], 'userModified': True}
dataset.set_schema(schema_to_set)


#Scenarios 
params_defined = { "view": "SCENARIO_RUNS"}
dataset_type = "StatsDB"
dataset_name = "scenario_runs_v1"
dataset = project.create_dataset(dataset_name, dataset_type, params_defined)

#internal stats datasets created without schema so hard coding to existing schema which would be created via UI
dataset = dataikuapi.dss.dataset.DSSDataset(client,project_key,dataset_name)
schema_to_set = {'columns': [{'name': 'scenario_project_key', 'type': 'string'}, {'name': 'scenario_id', 'type': 'string'}, {'name': 'scenario_run_id', 'type': 'string'}, {'name': 'time_start', 'type': 'date'}, {'name': 'time_end', 'type': 'date'}, {'name': 'outcome', 'type': 'string'}, {'name': 'warnings_count', 'type': 'bigint'}, {'name': 'scenario_name', 'type': 'string'}, {'name': 'trigger_name', 'type': 'string'}, {'name': 'scenario_run_as_user', 'type': 'string'}, {'name': 'run_as_user_identifier', 'type': 'string'}, {'name': 'run_as_user_via', 'type': 'string'}], 'userModified': True}
dataset.set_schema(schema_to_set)

If you want to ultimately access via SQL directly you can create sync recipes programmatically using the example provided here: https://doc.dataiku.com/dss/latest/python-api/flow.html#creating-a-sync-recipe

e.g

#Create and Run Sync recipe
builder = project.new_recipe("sync")
builder = builder.with_input(dataset_name)
builder = builder.with_new_output(dataset_name + 'sql', "Postgres-Localhost")
recipe = builder.create()
job = recipe.run()

Dataiku Pyhton API - Get project timeline infomation (project editting info)

Answers

Categories

Setup Info

Tags