Dataiku Pyhton API - Get project timeline infomation (project editting info)
I have been trying to find a method of gathering the infomation that the project timeline provides i.e. the edit history. I would like to the use the Python API for this. Please can you advise me as to how I can do this?
Answers
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,248 Dataiker
Hi @ArvinUbhi
,Git history is not available directly via the Public Python APIs. You can view the edit history under Version Control from the UI for a project. This will show the history of the edits to the project metadata,
The easiest way to retrieve this data and use it in Python code would be to create an internal commits database as explained here.
Then use that respective dataset as needed. Let me know if the internal commits dataset would work for your use case.
Thanks,
-
Hi @AlexT
,Thanks for your response.
Unfortunatley, im running it on every project in the environment as I a formulating a report on which projects have the most recent or more frequent activity so starting a internal commits dataset wouldnt work in this case (however it is very handy to know that, so thank you)
I have read that theres a realtime database that dataiku maintains (which can also be written ut to a postgres db) which contains the timeline infomation on a project. I guess my question is how can I access that via a python script within a dss environment?
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,248 Dataiker
Hi @ArvinUbhi
,What DSS version are you on?
You are able to generate both the commits and human-readable history for all projects by leaving the project key blank, so you only need to build this dataset in a single project.
See below examples :
This would provide the equivalent tables as if you directly querying the runtime database which is NOT advised. You can import these datasets into your python notebook, recipe, etc, and manipulate this as you wish.
Let me know if I misunderstood anything and if the above can work for your use case.
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,248 Dataiker
Sorry, I should have noted that what you are likely looking for is the "Objects States" dataset instead of the commits. As this the same information that would be in the project timeline.
-
I am on version 8.0.3.
Thats very helpful. Is there a way that I can access these directly from python script without building a dataiku dataset ie from a notebook or scenario?
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,248 Dataiker
Not exactly, you will have to create the internal dataset in a project. For example, you can have a "monitoring" project where you create the dataset and then use it from Scenario/Notebook.
# Example: load a DSS dataset as a Pandas dataframe import dataiku mydataset = dataiku.Dataset("object_states_all_projects") mydataset_df = mydataset.get_dataframe() mydataset_df
If you prefer to use this as an SQL database you can use Sync Recipe to Sync the internal dataset to your database. Then access it either a dataset or using SQLExecutor.
import dataiku from dataiku import SQLExecutor2 executor = SQLExecutor2(connection="my-sql-database") # or dataset="dataset_name" df = executor.query_to_df('SELECT * from "internal_objects_database_synced"')
Hope this helps let me know if you have any other questions.
-
Thank you! That helps alot. Can you please show me a script to access the scenarios table in the internal runtime db and the commits table?
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,248 Dataiker
Hi,
You can create the Internal StatsDB datasets from the Python API directly, again I would highly discourage accessing the runtime database directly hence using the Internal Stats datasets is preferred. TO create all 4 types of datasets from API you can use :
import dataiku import dataikuapi import pandas as pd, numpy as np # retrieve dataset details client = dataiku.api_client() project_key = 'INTERNAL_STATS' project = client.get_project('INTERNAL_STATS') # Different types of stats DB #Cluster TASKS params_defined = { "view": "CLUSTER_TASKS"} dataset_type = "StatsDB" dataset_name = "cluster_tasks_v1" dataset_create = project.create_dataset(dataset_name, dataset_type, params_defined) #internal stats datasets created without schema so hard coding to existing schema which would be created via UI dataset = dataikuapi.dss.dataset.DSSDataset(client,project_key,dataset_name) schema_to_set = {'columns': [{'name': 'connection', 'type': 'string'}, {'name': 'task_type', 'type': 'string'}, {'name': 'project_key', 'type': 'string'}, {'name': 'task_data', 'type': 'string'}, {'name': 'user', 'type': 'string'}, {'name': 'start_time', 'type': 'bigint'}, {'name': 'end_time', 'type': 'bigint'}], 'userModified': True} dataset.set_schema(schema_to_set) # COMMITS params_defined = { "view": "COMMITS"} dataset_type = "StatsDB" dataset_name = "commits_v1" dataset_create = project.create_dataset(dataset_name, dataset_type, params_defined) #internal stats datasets created without schema so hard coding to existing schema which would be created via UI dataset = dataikuapi.dss.dataset.DSSDataset(client,project_key,dataset_name) schema_to_set = {'columns': [{'name': 'project_key', 'type': 'string'}, {'name': 'commit_id', 'type': 'string'}, {'name': 'author', 'type': 'string'}, {'name': 'timestamp', 'type': 'bigint'}, {'name': 'added_files', 'type': 'int'}, {'name': 'added_lines', 'type': 'int'}, {'name': 'removed_files', 'type': 'int'}, {'name': 'removed_lines', 'type': 'int'}, {'name': 'changed_files', 'type': 'int'}], 'userModified': True} dataset.set_schema(schema_to_set) #FLOW ACTIONS params_defined = { "view": "FLOW_ACTIONS"} dataset_type = "StatsDB" dataset_name = "flow_actions_v1" dataset = project.create_dataset(dataset_name, dataset_type, params_defined) #internal stats datasets created without schema so hard coding to existing schema which would be created via UI dataset = dataikuapi.dss.dataset.DSSDataset(client,project_key,dataset_name) schema_to_set = {'columns': [{'name': 'project_key', 'type': 'string'}, {'name': 'object_id', 'type': 'string'}, {'name': 'partition', 'type': 'string'}, {'name': 'job_project_key', 'type': 'string'}, {'name': 'job_id', 'type': 'string'}, {'name': 'activity_id', 'type': 'string'}, {'name': 'scenario_project_key', 'type': 'string'}, {'name': 'scenario_id', 'type': 'string'}, {'name': 'scenario_run_id', 'type': 'string'}, {'name': 'step_id', 'type': 'string'}, {'name': 'step_run_id', 'type': 'string'}, {'name': 'time_start', 'type': 'date'}, {'name': 'time_end', 'type': 'date'}, {'name': 'outcome', 'type': 'string'}, {'name': 'result', 'type': 'string'}, {'name': 'warnings_count', 'type': 'bigint'}, {'name': 'type', 'type': 'string'}, {'name': 'is_last', 'type': 'boolean'}], 'userModified': True} dataset.set_schema(schema_to_set) #Scenarios params_defined = { "view": "SCENARIO_RUNS"} dataset_type = "StatsDB" dataset_name = "scenario_runs_v1" dataset = project.create_dataset(dataset_name, dataset_type, params_defined) #internal stats datasets created without schema so hard coding to existing schema which would be created via UI dataset = dataikuapi.dss.dataset.DSSDataset(client,project_key,dataset_name) schema_to_set = {'columns': [{'name': 'scenario_project_key', 'type': 'string'}, {'name': 'scenario_id', 'type': 'string'}, {'name': 'scenario_run_id', 'type': 'string'}, {'name': 'time_start', 'type': 'date'}, {'name': 'time_end', 'type': 'date'}, {'name': 'outcome', 'type': 'string'}, {'name': 'warnings_count', 'type': 'bigint'}, {'name': 'scenario_name', 'type': 'string'}, {'name': 'trigger_name', 'type': 'string'}, {'name': 'scenario_run_as_user', 'type': 'string'}, {'name': 'run_as_user_identifier', 'type': 'string'}, {'name': 'run_as_user_via', 'type': 'string'}], 'userModified': True} dataset.set_schema(schema_to_set)
If you want to ultimately access via SQL directly you can create sync recipes programmatically using the example provided here: https://doc.dataiku.com/dss/latest/python-api/flow.html#creating-a-sync-recipe
e.g
#Create and Run Sync recipe builder = project.new_recipe("sync") builder = builder.with_input(dataset_name) builder = builder.with_new_output(dataset_name + 'sql', "Postgres-Localhost") recipe = builder.create() job = recipe.run()