Community Conundrum 28: News Engagement is live! Read More

Tasks history

Level 1
Tasks history

Hi all,

Is there an easy way to check the tasks than ran on a specific date in the past?

Thank you.

0 Kudos
5 Replies
Neuron
Neuron

@dcarva23jnj are you looking to know about a specific project? User? Dataset? All of the above? Can you describe more specifically what you are looking to accomplish

Dataiku does have an audit trail that log all user interactions with DSS to (rolling) log files. By default the files are stored in run/audit within the DSS default data directory. That log files are json records that logged things like the user, activity, resources, time, etc. These are rolling log files, so depending on how far back you want to go, the files might not exist.

To actually leverage the information within these files, you can create a simple DSS project that reads the json file(s) and then filters the dataset in DSS by date (and projectKey, User, or a variety of other fields) to see what was done on a particular day. This will be very granular and may be in excess of what you were hoping for. Depending on how crazy you want to get, you could probably visualize with a dashboard or webapp.

Let me know if that helps at all.

Level 1
Author

Hi @tim-wright 

My need is an Administration need.

I want to have the full picture of the tasks that ran in the previous week for example. Basically wanted to know if Dataiku already provided something that analyzed the logs or where in the API to get this information for all Dataiku instance and not only a specific job. 

0 Kudos

@dcarva23jnj I'm not aware of anything that Dataiku has out of the box to do this - particularly the analytics on top of the audit log data. Maybe someone else in the community has other knowledge that I do not.

You can use the DSS UI to see a recent history of the logs (see this link) but that is just a recent sample (of max 1000 events since last DSS restart). 

Apart from checking the default log location on DSS noted previously, and alternate and more robust way to persist the event log would be to send the DSS events to storage that is external to DSS (and not worry about older logs overwriting audit history on DSS). There are a variety of ways to do this in DSS. The DSS EventServer looks to be a relatively straightforward way (disclosure: I have no experience with it)

That would of course take care of logging the data outside of DSS so they don't get overwritten, but would still not analyze the data for you. The analysis could be done in DSS or another tool depending on your specific requirements.

0 Kudos

Hi @dcarva23jnj. There are in fact some tools to analyse the tasks, jobs or scenarios run through the python API, and also for looking into the logs, all available within DSS.

The problem, in my opinion, is that those methods still lack a better documentation and more highlighting to the users in general. And just as @tim-wright I was not aware of them until very recently.

Some examples of the capabilities that are available through the main DSSClient Class:

import dataiku
client = dataiku.api_client()

To query the logs files available:

client.list_logs()

To access the data of one log file from the list, let's say 'backend.log'

bck_log = client.get_log('backend.log')

The output is a dictionary with all entries in the file under the key 'tail' and then 'lines' (plus some metadata):

{'name': 'backend.log',
 'totalSize': 30740316,
 'lastModified': 1605317248000,
 'tail': {'totalLines': 0,
  'lines': ['[2020/11/13-22:26:26.391] [qtp461591680-114995] [DEBUG] [dku.tracing]  - [ct: 1] Start call: /publicapi/projects/{projectKey}/datasets/ [GET] user=itoledo auth=<AC:user:itoledo via: ticket:jupyter:MAINTENANCE.Maintenance.ipynb> [projectKey=IMGADMINISTRATIVEHOURS tags=]',
   '[2020/11/13-22:26:26.392] [qtp461591680-114995] [DEBUG] [dku.tracing]  - [ct: 2] Done call: /publicapi/projects/{projectKey}/datasets/ [GET] time=2ms user=itoledo auth=<AC:user:itoledo via: ticket:jupyter:MAINTENANCE.Maintenance.ipynb> [projectKey=IMGADMINISTRATIVEHOURS tags=]',
   '[2020/11/13-22:26:26.401] [qtp461591680-115003] [DEBUG] [dku.tracing]  - [ct: 0] Start call: /publicapi/projects/{projectKey}/datasets/ [GET] user=itoledo auth=<AC:user:itoledo via: ticket:jupyter:MAINTENANCE.Maintenance.ipynb> [projectKey=INFLUXTESTER tags=]',...

In this way the files @tim-wright mentions, like audit, can be accessed. However, for some reason in the documentation it says that get_log() should return a string with the full log, but a dictionary is returned instead, and only the last 1000 lines are shown. 

A workaround, that works fairly well and is better than working with files in python, is to import the log files directly from the DSS instance, create a dataset from the filesystem_root, like in the two following screenshots (of course you need to change the paths according your DSS instance location):

import_logs_a.png

import_logs.png

----

Now, for the tasks or jobs run in the DSS projects, you also can get the information by querying the projects. First, you can list all the project's keys in the DSS instance:

client.list_project_keys()

and then get the project data with:

p = client.get_project('GRIDMONITORING')

(I'm using here a project in my DSS instance)

Now you can list all the jobs run on the project (as far back in time as your DSS instance job log retain policy, in our case are the jobs run in the last 15 days):

p.list_jobs()

[{'stableState': True,
  'def': {'type': 'RECURSIVE_BUILD',
   'projectKey': 'GRIDMONITORING',
   'id': 'sched_build_2020-11-14T00-27-11.924',
   'name': 'sched_build',
   'initiator': 'itoledo',
   'triggeredFrom': 'SCHEDULER',
   'initiationTimestamp': 1605313631924,
   'mailNotification': False,
   'outputs': [{'targetDatasetProjectKey': 'GRIDMONITORING',
     'targetDataset': 'CiC65p77'}],
   'refreshHiveMetastore': False},
  'state': 'DONE',
  'warningsCount': 0,
  'startTime': 1605313635078,
  'endTime': 1605313635828,
  'scenarioProjectKey': 'GRIDMONITORING',
  'scenarioId': 'UpdateGrid',
  'stepId': 'build_0_true_d_measured_last',
  'scenarioRunId': '2020-11-13-21-25-46-939',
  'stepRunId': '2020-11-13-21-27-11-923',
  'kernelPid': 0},
 {'stableState': True,... },
 ...
]

 This already will give you a lot of useful information, like the number of jobs run, duration, finish status, produced outputs, etc. But if I would like to know more, I can also get access to the jobs logs. For example, if I take the job id from the output that is being shown up here:

j1 = p.get_job('sched_build_2020-11-14T00-27-11.924')
j1.get_log()

And this return a string object that you could further parse for information.

So, with a little bit of python magic, you can generate your own weekly reports on the tasks and jobs run in a given list of projects or all of them. More references:

Project object api (obtained with get_project()): https://doc.dataiku.com/dss/latest/python-api/projects.html#projects

Logs api (obtained with get_log()): https://doc.dataiku.com/dss/latest/python-api/client.html#dataikuapi.DSSClient.get_log

Jobs: https://doc.dataiku.com/dss/latest/python-api/jobs.html

Have a good weekend!

Ignacio

Level 1
Author

Hi @Ignacio_Toledo,

Thank you very much for the detailed explanation!

I will for sure give it a try.

Best regards,

0 Kudos
A banner prompting to get Dataiku DSS