User Login History

Carl7189 · January 2021

Hello everyone

I currently use Dataiku version 6.0.4 and I would like to know if there is any type of element or configuration that tells me how many users log into the application in the day and if I can check the history of how many users logged into Dataiku DSS in the month.

Stay tuned

Thanks a lot

Sergey · January 2021

Lol @Clément_Stenac
. Thanks for the pointer.

Yeah, so basically, @Carl7189
you can just create a managed folder pointing to DATA_DIR/run/audit log directory and create a dataset based on it. Then use prepare recipe to filter the data by /api/login calls and keep only API call and timestamp:

Screenshot 2021-01-08 at 11.21.33.png

from this point, you can group/filter by time to get what you were looking for. It was easier than we all thought.

tgb417 · January 2021

@Carl7189
,

In DSS V8.0.4 (the latest version) at a project level, you have the Activity Summary, Contributors, Project Activity Summary.jpg

Contributors,

Project Activity Contributors.jpg
and Punch card (not shown here).

This feature looks to have been added in V3.0.0

During the same version, There was a "Global usage Summary" added.

However, I'm not finding a lot of documentation on this feature.

cc: @CoreyS

Sergey · January 2021

Hi @Carl7189

User login information stored in audit log files. The recent activity is shown on the Administration>Security>Audit Trail page but that's will not suit your request (per day/month count).

You will need to grep recursively the records in DATA_DIR/run/audit directory by "/api/login" and return only the username and timestamp but not sure how you can count the number per day/month. Technically, you can make this grep call from python recipe and use pandas dataframe to save this as a table. After this, you can filter the timestamps by the day or by month.

Sarina · January 2021

Hi Carl,

In addition to the project-level metrics, you can also parse your audit.log file to retrieve the total number of active users in DSS.

Here is some starter code for how you could do this in a Python recipe, pulling all active users from your available audit.log file data. This example writes the output to the new dataset active_users, that has an active_date and user column .

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

from datetime import datetime, timedelta
import json 
import dataiku

f = open('../../../../../../run/audit/audit.log')
lines = f.readlines()

active_users_set = set() # date, user tuple

# once historical data is populated, a check can be run on on the timestamps to only process data for the previous day 
yesterday = datetime.today() - timedelta(days=1)
yesterday_formatted = yesterday.strftime('%Y-%m-%d')

# each line of the audit log 
for line in lines:
    entry = json.loads(line)
    # just pull the date field 
    audit_date = entry['timestamp'].split('T')[0]
    # for daily runs, can include the check: `if audit_date == yesterday_formatted:`
    if 'message' in entry and 'authUser' in entry['message']:
        user_date = (audit_date, entry['message']['authUser'])
        active_users_set.add(user_date)

array_of_entries = []
for entry in active_users_set:
   array_of_entries.append(list(entry))

active_users_array = list(array_of_entries) 
active_users_df = pd.DataFrame.from_records(active_users_array)
active_users_df.columns = ['active_date', 'user']

# append to dataset that stores all active users per day
au = dataiku.Dataset("active_users")
au.write_with_schema(active_users_df)

Keep in mind that your audit log may wrap more frequently than once a day, and that would need to be accounted for as well.

From here, you can aggregate the resultant dataset by day / week / month in order to get your aggregated user counts.

Thanks,
Sarina

tgb417 · January 2021

@SarinaS

In testing out the code segment. I got to the following line

f = open('../../../../../../run/audit/audit.log')

And I get the following.

IOErrorTraceback (most recent call last)
<ipython-input-3-dd8957f59dcb> in <module>()
----> 1 f = open('../../../../../../run/audit/audit.log')
      2 lines = f.readlines()

IOError: [Errno 2] No such file or directory: '../../../../../../run/audit/audit.log'

When I try.

f = open('../../../../run/audit/audit.log')

on a Macintosh Installation, only 4 directories up, I end up getting something to work. That get me back to my DSS Home directory.

Is there a variable that can be gotten to in Python run under DSS that will better produce the home directory location?

tgb417 · January 2021

@SarinaS

When running your demo code I'm running into a problem. That I don't understand.

# each line of the audit log
for line in lines:
    entry = json.loads(line)
    # just pull the date field
    audit_date = entry['timestamp'].split('T')[0]
    # for daily runs, can include the check: `if audit_date == yesterday_formatted:`
    if 'message' in entry and 'authUser' in entry['message']:
        user_date = (audit_date, entry['message']['authUser'])
        active_users_set.add(user_date)

This iterates about 2300 times then I get the following error.

TypeErrorTraceback (most recent call last)
<ipython-input-76-f01f734a6b86> in <module>()
      6     # for daily runs, can include the check: `if audit_date == yesterday_formatted:`
      7     if 'message' in entry and 'authUser' in entry['message']:
----> 8         user_date = (audit_date, entry['message']['authUser'])
      9         active_users_set.add(user_date)

TypeError: string indices must be integers

Something seems to be wrong with the data. A visual inspection of records at this point in the file does not show visually an obvious problem.

However, the problem seems to be with the statement

entry['message']['authUser']

When I pull this out of the user_data assignment making it user_data = (audit_date). Then I run to compilation. Hmmmm...

Thoughts?

Testing from Jupyter Notebook using DSS V8.0.4 on Macintosh. (The problems show up using either Python 2 or Python 3.

Sarina · January 2021

Hey Tom,

The relative path will depend on how the python code is run. As you point out, you could create a global variable that points to your data directory full path like so:

{"DATA_DIR": "/path/to/dss"}

And then you can use the variable to set the path to your logging files:

import dataiku 
client = dataiku.api_client()
vars = client.get_variables()

f = open(vars['DATA_DIR'] + '/run/audit/audit.log')

For the indices error that you are seeing, I would suggest adding a print statement in the for loop to verify the entry data. You may need to add another validity check before the user_date assignment.

Thanks,

Sarina

tgb417 · January 2021

@SarinaS

In a quick check, I could not find what was wrong with the data. It visually looked OK.

So, I don't know what additional check to put in place.

I'm going to let it go at least for tonight.

Clément_Stenac · January 2021

If only there was a software that could help you visually prepare and analyze data rather than having to write code for that

tgb417 · January 2021

@sergeyd
,

When thinking about operationalizing something like this. What is the best way to create this managed folder?

I found this link in the documentation to "Creating a managed folder".

In this documentation, there is a statement that says. "The default connection to create managed folders on is named "managed_folders". However when I look at setting up a managed folder, I'm not seeing a "managed_folders" option. (Or I'm looking in the wrong place.)

Setting up managed_folders.jpg

I also see this note that I do not completely understand.

Creating Managed Folder Note.jpg

Is this suggesting the creation of a managed connection for accessing files from multiple projects?

Sergey · January 2021

@Carl7189

Just in case you need a step-by-step procedure:

1) Create a new FS connection (Administration->Connections):

Screenshot 2021-01-08 at 19.39.03.png

2) Set the path to audit directory:

Screenshot 2021-01-08 at 19.39.34.png

3) Create a new dataset based on this connection (make sure to read all the files):

Screenshot 2021-01-08 at 19.42.38.png

4) Set the name and save it.

From this point, you can use prepare or any other recipes you want.

tgb417 · January 2021

@sergeyd

When it comes to the other setting on the connection. If one want to make this production ready.

Connection Usage Patterns.jpg

I'm wondering about the setting for the USAGE PARAMS...

I'd think that I want to set "Allow write" to unchecked from the point of view of this connection. The same thing with "Allow managed datasets" and "Allow managed folders".

These are just text files. For safety do we need to set Max nb of activities to one. Or can multiple processes have the files open at the same time?

What do you think?

tgb417 · January 2021

@sergeyd
,

So, I'm still on this quest.

When using DSS 8.0.4, I'm getting some warnings when trying to create the dataset in a project using the audit connection. Here is what I'm seeing.

creating audit connection.jpg

What is this warning about? Is this anything to be concerned about?

Sergey · January 2021

@tgb417
we have a check for the path while creating a connection. If it's just '/' we will show that message. This is just a precaution in case you really have the path set to '/' (root FS) meaning you can break your Linux if you write/delete something from it.

If you have this connection created from DATA_DIR/run/audit/, this '/' just means the root for audit directory (not "root" of the entire FS) so no worries here.

tgb417 · January 2021

@Clément_Stenac

This is very cool.

Is there any documentation for the meaning of the columns of these logs?

Ignacio_Toledo · January 2021

Hi everyone. @sergeyd
, why to create a new connection, and not just create a new dataset from the Filesystem?

Since a few months I've been monitoring the resource usage by the users, and I did create a Filesystem dataset with this configuration:

(The path is valid only for my particular machine, where the DSS_DATA_DIR is at /data/dataiku/dss_data, of course)

This have been working without problem, so I wonder if you recommend creating a new connection because of some security concerns?

Thanks! Ignacio.

Ignacio_Toledo · January 2021

By the way, these are the parser settings on the format/preview section when creating the dataset from the audit files... just in case they are useful for others when parsing:

Sergey · January 2021

@Ignacio_Toledo

Sure, you can use the default root filesystem connection but assuming it gives access to entire FS you need to be sure that users know what they are doing instead of putting random stuff into the directories they shouldn't put. Of course, root-owned directories and files will remain untouched but I wouldn't go that route.

Ignacio_Toledo · January 2021

OK, I just found a problem that one needs to take into account, related to the schema detection after setting the parser options, because the audit file, as json, is semi-structured.

When creating one of the snapshots I attached, I reloaded the "preview data" and this happened:

This happened because the audit.log file used for the preview did have some messages (keys) that previously were not detected in a different audit file.

So, one needs to be careful when setting the schema... I guess.

Do you have any idea or recommendation for this kind of situation @sergeyd
?

P.D.: Maybe I should create a new thread for this topic? I feel like we are taking over (and going further than) what was asked in the first place by @Carl7189

tgb417 · January 2021

@Ignacio_Toledo

Up to now it seems like we are mostly on track with the original question.

However, if we are going to get off onto a discussion of maintaining schema and managing connections maybe we should start a new thread. (Maybe put a link in this thread to that new thread in case folks want to follow.)

Carl7189 · January 2021

Thanks @sergeyd
I understand that this would be a manual process and not automated, right?

tgb417 · January 2021

@Carl7189
,

As with all data operations in Dataiku DSS you could create a scenario to automate many aspects of processing the data. That information could then be presented as a dashboard for the consumption of insights.

Manuel · January 2021

Hi all,

If you are a paying customer wanting to monitor user and project activity, please contact your Customer Success Manager. They will be able to provide a solution for this need.

Thank you.

User Login History

Best Answer

Answers

Categories

Setup Info

Tags