Dataiku Pyhton API - Get project timeline infomation (project editting info)

Options
ArvinUbhi
ArvinUbhi Dataiku DSS Core Designer, Registered Posts: 16 ✭✭✭✭

I have been trying to find a method of gathering the infomation that the project timeline provides i.e. the edit history. I would like to the use the Python API for this. Please can you advise me as to how I can do this?

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi @ArvinUbhi
    ,

    Git history is not available directly via the Public Python APIs. You can view the edit history under Version Control from the UI for a project. This will show the history of the edits to the project metadata,

    The easiest way to retrieve this data and use it in Python code would be to create an internal commits database as explained here.

    Then use that respective dataset as needed. Let me know if the internal commits dataset would work for your use case.

    Thanks,

  • ArvinUbhi
    ArvinUbhi Dataiku DSS Core Designer, Registered Posts: 16 ✭✭✭✭
    Options

    Hi @AlexT
    ,

    Thanks for your response.

    Unfortunatley, im running it on every project in the environment as I a formulating a report on which projects have the most recent or more frequent activity so starting a internal commits dataset wouldnt work in this case (however it is very handy to know that, so thank you)

    I have read that theres a realtime database that dataiku maintains (which can also be written ut to a postgres db) which contains the timeline infomation on a project. I guess my question is how can I access that via a python script within a dss environment?

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi @ArvinUbhi
    ,

    What DSS version are you on?

    You are able to generate both the commits and human-readable history for all projects by leaving the project key blank, so you only need to build this dataset in a single project.

    See below examples :

    Screenshot 2021-10-19 at 17.44.50.pngScreenshot 2021-10-19 at 17.45.16.png

    This would provide the equivalent tables as if you directly querying the runtime database which is NOT advised. You can import these datasets into your python notebook, recipe, etc, and manipulate this as you wish.

    Let me know if I misunderstood anything and if the above can work for your use case.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Sorry, I should have noted that what you are likely looking for is the "Objects States" dataset instead of the commits. As this the same information that would be in the project timeline.

  • ArvinUbhi
    ArvinUbhi Dataiku DSS Core Designer, Registered Posts: 16 ✭✭✭✭
    Options

    I am on version 8.0.3.

    Thats very helpful. Is there a way that I can access these directly from python script without building a dataiku dataset ie from a notebook or scenario?

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    edited July 17
    Options

    Not exactly, you will have to create the internal dataset in a project. For example, you can have a "monitoring" project where you create the dataset and then use it from Scenario/Notebook.

    # Example: load a DSS dataset as a Pandas dataframe
    import dataiku
    
    mydataset = dataiku.Dataset("object_states_all_projects")
    mydataset_df = mydataset.get_dataframe()
    mydataset_df

    If you prefer to use this as an SQL database you can use Sync Recipe to Sync the internal dataset to your database. Then access it either a dataset or using SQLExecutor.

    import dataiku
    from dataiku import SQLExecutor2
    
    executor = SQLExecutor2(connection="my-sql-database") # or dataset="dataset_name"
    
    df = executor.query_to_df('SELECT * from "internal_objects_database_synced"')
    

    Hope this helps let me know if you have any other questions.

  • ArvinUbhi
    ArvinUbhi Dataiku DSS Core Designer, Registered Posts: 16 ✭✭✭✭
    Options

    Thank you! That helps alot. Can you please show me a script to access the scenarios table in the internal runtime db and the commits table?

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    edited July 17
    Options

    Hi,

    You can create the Internal StatsDB datasets from the Python API directly, again I would highly discourage accessing the runtime database directly hence using the Internal Stats datasets is preferred. TO create all 4 types of datasets from API you can use :

    import dataiku
    import dataikuapi
    import pandas as pd, numpy as np
    
    # retrieve dataset details 
    client = dataiku.api_client()
    project_key = 'INTERNAL_STATS'
    project = client.get_project('INTERNAL_STATS')
    
    # Different types of stats DB 
    
    #Cluster TASKS
    params_defined = { "view": "CLUSTER_TASKS"}
    dataset_type = "StatsDB"
    dataset_name = "cluster_tasks_v1"
    dataset_create = project.create_dataset(dataset_name, dataset_type, params_defined)
    
    #internal stats datasets created without schema so hard coding to existing schema which would be created via UI
    dataset = dataikuapi.dss.dataset.DSSDataset(client,project_key,dataset_name)
    schema_to_set = {'columns': [{'name': 'connection', 'type': 'string'}, {'name': 'task_type', 'type': 'string'}, {'name': 'project_key', 'type': 'string'}, {'name': 'task_data', 'type': 'string'}, {'name': 'user', 'type': 'string'}, {'name': 'start_time', 'type': 'bigint'}, {'name': 'end_time', 'type': 'bigint'}], 'userModified': True}
    dataset.set_schema(schema_to_set)
    
    # COMMITS
    params_defined = { "view": "COMMITS"}
    dataset_type = "StatsDB"
    dataset_name = "commits_v1"
    dataset_create = project.create_dataset(dataset_name, dataset_type, params_defined)
    
    #internal stats datasets created without schema so hard coding to existing schema which would be created via UI
    dataset = dataikuapi.dss.dataset.DSSDataset(client,project_key,dataset_name)
    schema_to_set = {'columns': [{'name': 'project_key', 'type': 'string'}, {'name': 'commit_id', 'type': 'string'}, {'name': 'author', 'type': 'string'}, {'name': 'timestamp', 'type': 'bigint'}, {'name': 'added_files', 'type': 'int'}, {'name': 'added_lines', 'type': 'int'}, {'name': 'removed_files', 'type': 'int'}, {'name': 'removed_lines', 'type': 'int'}, {'name': 'changed_files', 'type': 'int'}], 'userModified': True}
    dataset.set_schema(schema_to_set)
    
    
    #FLOW ACTIONS 
    params_defined = { "view": "FLOW_ACTIONS"}
    dataset_type = "StatsDB"
    dataset_name = "flow_actions_v1"
    dataset = project.create_dataset(dataset_name, dataset_type, params_defined)
    
    #internal stats datasets created without schema so hard coding to existing schema which would be created via UI
    dataset = dataikuapi.dss.dataset.DSSDataset(client,project_key,dataset_name)
    schema_to_set = {'columns': [{'name': 'project_key', 'type': 'string'}, {'name': 'object_id', 'type': 'string'}, {'name': 'partition', 'type': 'string'}, {'name': 'job_project_key', 'type': 'string'}, {'name': 'job_id', 'type': 'string'}, {'name': 'activity_id', 'type': 'string'}, {'name': 'scenario_project_key', 'type': 'string'}, {'name': 'scenario_id', 'type': 'string'}, {'name': 'scenario_run_id', 'type': 'string'}, {'name': 'step_id', 'type': 'string'}, {'name': 'step_run_id', 'type': 'string'}, {'name': 'time_start', 'type': 'date'}, {'name': 'time_end', 'type': 'date'}, {'name': 'outcome', 'type': 'string'}, {'name': 'result', 'type': 'string'}, {'name': 'warnings_count', 'type': 'bigint'}, {'name': 'type', 'type': 'string'}, {'name': 'is_last', 'type': 'boolean'}], 'userModified': True}
    dataset.set_schema(schema_to_set)
    
    
    #Scenarios 
    params_defined = { "view": "SCENARIO_RUNS"}
    dataset_type = "StatsDB"
    dataset_name = "scenario_runs_v1"
    dataset = project.create_dataset(dataset_name, dataset_type, params_defined)
    
    #internal stats datasets created without schema so hard coding to existing schema which would be created via UI
    dataset = dataikuapi.dss.dataset.DSSDataset(client,project_key,dataset_name)
    schema_to_set = {'columns': [{'name': 'scenario_project_key', 'type': 'string'}, {'name': 'scenario_id', 'type': 'string'}, {'name': 'scenario_run_id', 'type': 'string'}, {'name': 'time_start', 'type': 'date'}, {'name': 'time_end', 'type': 'date'}, {'name': 'outcome', 'type': 'string'}, {'name': 'warnings_count', 'type': 'bigint'}, {'name': 'scenario_name', 'type': 'string'}, {'name': 'trigger_name', 'type': 'string'}, {'name': 'scenario_run_as_user', 'type': 'string'}, {'name': 'run_as_user_identifier', 'type': 'string'}, {'name': 'run_as_user_via', 'type': 'string'}], 'userModified': True}
    dataset.set_schema(schema_to_set)
    
    
    
    

    If you want to ultimately access via SQL directly you can create sync recipes programmatically using the example provided here: https://doc.dataiku.com/dss/latest/python-api/flow.html#creating-a-sync-recipe

    e.g

    #Create and Run Sync recipe
    builder = project.new_recipe("sync")
    builder = builder.with_input(dataset_name)
    builder = builder.with_new_output(dataset_name + 'sql', "Postgres-Localhost")
    recipe = builder.create()
    job = recipe.run()
Setup Info
    Tags
      Help me…