Tasks history

dcarva23jnj · November 2020

Hi all,

Is there an easy way to check the tasks than ran on a specific date in the past?

Thank you.

tim-wright · November 2020

@dcarva23jnj
are you looking to know about a specific project? User? Dataset? All of the above? Can you describe more specifically what you are looking to accomplish

Dataiku does have an audit trail that log all user interactions with DSS to (rolling) log files. By default the files are stored in run/audit within the DSS default data directory. That log files are json records that logged things like the user, activity, resources, time, etc. These are rolling log files, so depending on how far back you want to go, the files might not exist.

To actually leverage the information within these files, you can create a simple DSS project that reads the json file(s) and then filters the dataset in DSS by date (and projectKey, User, or a variety of other fields) to see what was done on a particular day. This will be very granular and may be in excess of what you were hoping for. Depending on how crazy you want to get, you could probably visualize with a dashboard or webapp.

Let me know if that helps at all.

dcarva23jnj · November 2020

Hi @tim-wright

My need is an Administration need.

I want to have the full picture of the tasks that ran in the previous week for example. Basically wanted to know if Dataiku already provided something that analyzed the logs or where in the API to get this information for all Dataiku instance and not only a specific job.

tim-wright · November 2020

@dcarva23jnj
I'm not aware of anything that Dataiku has out of the box to do this - particularly the analytics on top of the audit log data. Maybe someone else in the community has other knowledge that I do not.

You can use the DSS UI to see a recent history of the logs (see this link) but that is just a recent sample (of max 1000 events since last DSS restart).

Apart from checking the default log location on DSS noted previously, and alternate and more robust way to persist the event log would be to send the DSS events to storage that is external to DSS (and not worry about older logs overwriting audit history on DSS). There are a variety of ways to do this in DSS. The DSS EventServer looks to be a relatively straightforward way (disclosure: I have no experience with it)

That would of course take care of logging the data outside of DSS so they don't get overwritten, but would still not analyze the data for you. The analysis could be done in DSS or another tool depending on your specific requirements.

Ignacio_Toledo · November 2020

Hi @dcarva23jnj
. There are in fact some tools to analyse the tasks, jobs or scenarios run through the python API, and also for looking into the logs, all available within DSS.

The problem, in my opinion, is that those methods still lack a better documentation and more highlighting to the users in general. And just as @tim-wright
I was not aware of them until very recently.

Some examples of the capabilities that are available through the main DSSClient Class:

import dataiku
client = dataiku.api_client()

To query the logs files available:

client.list_logs()

To access the data of one log file from the list, let's say 'backend.log'

bck_log = client.get_log('backend.log')

The output is a dictionary with all entries in the file under the key 'tail' and then 'lines' (plus some metadata):

{'name': 'backend.log',
 'totalSize': 30740316,
 'lastModified': 1605317248000,
 'tail': {'totalLines': 0,
  'lines': ['[2020/11/13-22:26:26.391] [qtp461591680-114995] [DEBUG] [dku.tracing]  - [ct: 1] Start call: /publicapi/projects/{projectKey}/datasets/ [GET] user=itoledo auth=<AC:user:itoledo via: ticket:jupyter:MAINTENANCE.Maintenance.ipynb> [projectKey=IMGADMINISTRATIVEHOURS tags=]',
   '[2020/11/13-22:26:26.392] [qtp461591680-114995] [DEBUG] [dku.tracing]  - [ct: 2] Done call: /publicapi/projects/{projectKey}/datasets/ [GET] time=2ms user=itoledo auth=<AC:user:itoledo via: ticket:jupyter:MAINTENANCE.Maintenance.ipynb> [projectKey=IMGADMINISTRATIVEHOURS tags=]',
   '[2020/11/13-22:26:26.401] [qtp461591680-115003] [DEBUG] [dku.tracing]  - [ct: 0] Start call: /publicapi/projects/{projectKey}/datasets/ [GET] user=itoledo auth=<AC:user:itoledo via: ticket:jupyter:MAINTENANCE.Maintenance.ipynb> [projectKey=INFLUXTESTER tags=]',...

In this way the files @tim-wright
mentions, like audit, can be accessed. However, for some reason in the documentation it says that get_log() should return a string with the full log, but a dictionary is returned instead, and only the last 1000 lines are shown.

A workaround, that works fairly well and is better than working with files in python, is to import the log files directly from the DSS instance, create a dataset from the filesystem_root, like in the two following screenshots (of course you need to change the paths according your DSS instance location):

----

Now, for the tasks or jobs run in the DSS projects, you also can get the information by querying the projects. First, you can list all the project's keys in the DSS instance:

client.list_project_keys()

and then get the project data with:

p = client.get_project('GRIDMONITORING')

(I'm using here a project in my DSS instance)

Now you can list all the jobs run on the project (as far back in time as your DSS instance job log retain policy, in our case are the jobs run in the last 15 days):

p.list_jobs()

[{'stableState': True,
  'def': {'type': 'RECURSIVE_BUILD',
   'projectKey': 'GRIDMONITORING',
   'id': 'sched_build_2020-11-14T00-27-11.924',
   'name': 'sched_build',
   'initiator': 'itoledo',
   'triggeredFrom': 'SCHEDULER',
   'initiationTimestamp': 1605313631924,
   'mailNotification': False,
   'outputs': [{'targetDatasetProjectKey': 'GRIDMONITORING',
     'targetDataset': 'CiC65p77'}],
   'refreshHiveMetastore': False},
  'state': 'DONE',
  'warningsCount': 0,
  'startTime': 1605313635078,
  'endTime': 1605313635828,
  'scenarioProjectKey': 'GRIDMONITORING',
  'scenarioId': 'UpdateGrid',
  'stepId': 'build_0_true_d_measured_last',
  'scenarioRunId': '2020-11-13-21-25-46-939',
  'stepRunId': '2020-11-13-21-27-11-923',
  'kernelPid': 0},
 {'stableState': True,... },
 ...
]

This already will give you a lot of useful information, like the number of jobs run, duration, finish status, produced outputs, etc. But if I would like to know more, I can also get access to the jobs logs. For example, if I take the job id from the output that is being shown up here:

j1 = p.get_job('sched_build_2020-11-14T00-27-11.924')
j1.get_log()

And this return a string object that you could further parse for information.

So, with a little bit of python magic, you can generate your own weekly reports on the tasks and jobs run in a given list of projects or all of them. More references:

Project object api (obtained with get_project()): https://doc.dataiku.com/dss/latest/python-api/projects.html#projects

Logs api (obtained with get_log()): https://doc.dataiku.com/dss/latest/python-api/client.html#dataikuapi.DSSClient.get_log

Jobs: https://doc.dataiku.com/dss/latest/python-api/jobs.html

Have a good weekend!

Ignacio

dcarva23jnj · November 2020

Hi @Ignacio_Toledo
,

Thank you very much for the detailed explanation!

I will for sure give it a try.

Best regards,

Ayukesock · August 2022

Dear Sir/Madame,

With regard to Open Source, I know of these three levels of engagement, that is use/reuse, contribution, champion and collaboration. I would to know if Dataiku was built on top of an open source project. Secondly, I would also like to know which levels of these engagements is Dataiku and how. whether contributing, championing or collaborating. I had it difficult to identify who are the external contributors.

CoreyS · August 2022

Hi @Ayukesock
and welcome to the Dataiku Community. Dataiku was built to allow organizations to benefit from the innovations and dynamism of open source tools while also providing an out-of-the-box platform where everything — the people, the technologies, the languages, and the data — is already connected and given a user-friendly interface.

Think of Dataiku as a control room of the more than 30 open source tools that the platform integrates with. Most of our in-house plugins are open source (Apache License). As a team, we contribute to major projects including:

cardinal: Dataiku Lab is the author and maintainer of this Python package designed to perform and monitor active learning experiments, leveraging various query sampling methods and metrics.
CodeMirror: We actively sponsor this Javascript code editor.
scikit-learn: We are part of the first consortium of corporate sponsors who support the development of this flagship machine learning library.

I'd welcome you to get started with our 14 day Free Trial or Install our Free Version to try it out for yourself. I'd also like to point you to this piece: Making Open Source a Sustainable Part of AI Strategy.

Tasks history

Answers

Categories

Setup Info

Tags