Ready for Dataiku 10? Try out the Crash Course on new features!GET STARTED

CPU Utilization Projectwise

Solved!
sj0071992
Level 3
Level 3
CPU Utilization Projectwise

 Hi Team,

 

Could you please let me know the way to see how much CPU is Utilized by a project? Also, how can o read it from Logs

 

Thanks in Advance

0 Kudos
1 Solution
fsergot
Dataiker
Dataiker

CRU data is part of the general Audit log of SS and there are option to dispatch this log outside of DSS. You need to go through this documentation: https://doc.dataiku.com/dss/latest/operations/audit-trail/centralization-and-dispatch.html

In there, you'll see that you can dispatch audit logs of DSS to 3 types of systems: Event Server (that's what we discuss before), a kafka cluster or leveraging log4j. All these options offers capabilities to integrate with Splunk:

  1. Event server is pushing audit logs as files, so you can have splunk reading those files.
  2. There is a Kafka to Splunk connector
  3. log4j is able to push its data into many outputs supported by Splunk

Note that this adds some complexity in the setup and Dataiku does not support the setup below the point of pushing audit log out.

View solution in original post

0 Kudos
17 Replies
fsergot
Dataiker
Dataiker

Hello @sj0071992 ,

This is a vast topic, especially considering the many ways CPU can be consumed by DSS projects (directly on the DSS server, through elastic AI & Kubernetes, in SQL processing ....).

We do have a built-in feature called Compute Resource Usage in DSS that keeps track of CPU consumed with as much lineage as possible (like when, which user, which project, which recipe, webapp or notebook...)

Here is the documentation on this topics -> https://doc.dataiku.com/dss/latest/operations/compute-resource-usage-reporting.html

I would encourage you to read that and activate it to see the data you have on your platform. Raw data is usually not enough so you will need some processing within a DSS project to produce some reports that may suits your needs.

0 Kudos
sj0071992
Level 3
Level 3
Author

Hi,

 

Thanks for your response.

I have the Logs, could you please let me know the Identifier which represents the CPU Utilization as the Logs are huge and I am unable to figure out that.

 

I found something like

[2021/08/26-14:38:01.769] [JEK-QVBCnR3u-log-1114828] [INFO] [dku.jobs.kernel]  - 2.897: [Full GC (Metadata GC Threshold) [PSYoungGen: 20579K->0K(611840K)] [ParOldGen: 16K->17992K(774656K)] 20595K->17992K(1386496K), [Metaspace: 21082K->21082K(1069056K)], 0.1439040 secs] [Times: user=1.97 sys=0.31, real=0.14 secs]

am I following the correct path?

 

Thanks in Advance

0 Kudos
fsergot
Dataiker
Dataiker

Hello @sj0071992 ,

This is not the type of log I was mentioning. However, I am wondering which kind of setup you have: do you have a commercial licence or a community one? Do you use Kubernetes, especially Spark on Kubernetes or SQL engines or is all your projects done using DSS own engine?

 

0 Kudos
sj0071992
Level 3
Level 3
Author

Hi,

 

We are using Commercial and regarding engine we are using In-Database as we Snowflake Datasources.

Thanks in Advance

0 Kudos
fsergot
Dataiker
Dataiker

So what you were looking at are the logs of the local DSS server. You can probably extract some data from there but it will be partial and you won't be able to link it to project/users.

CRU requires some setup to process, although it is always activated, to summarise:

  1. You need to configure your DSS instance to act as an Event server (see https://doc.dataiku.com/dss/latest/operations/audit-trail/eventserver.html)
  2. You need to configure the Event Server in Administration > Settings > Event server so that all data retrieved are stored in a file system (In the field 'Connection name', this needs to be an existing connection of your DSS server).

Once there, you can create a project in DSS, read from this connection and you will see actual ressource consumption and details its context (time, project, user, type of ressource, quantity of resource used...)

Screenshot 2021-09-07 at 17.39.21.png

0 Kudos
sj0071992
Level 3
Level 3
Author

Hi,

 

I have a question here, how to read this start time and end time, this is something a string value and how to interpret it?

 

Thanks in Advance

0 Kudos
fsergot
Dataiker
Dataiker

Hello,

For this question (my previous post was on an older one): those are EPOCH timestamps so you can convert them to data time easily in a prepare recipe for example (using the convert UNIX timestamp processor).

 

0 Kudos
sj0071992
Level 3
Level 3
Author

Hi,

 

Thanks for your response.

This is really helpful. But can we automate this? As I want to use this data in my Splunk Dashboard.

 

Thanks in Advance

0 Kudos
fsergot
Dataiker
Dataiker

Hello,

 

I am sorry I missed your question!

The answer depends on what you want to automate.

The original data of the CRU processing is stored in a file system defined in DSS event Server. You can source Splunk from there directly. This means however that you will need to do all the processing in Splunk.

Another option is to build a DSS flow that suits your needs, add a final step to push the data to splunk (using splunk own capabilities or Dataiku Splunk plugin) and then create a scenario that builds the output dataset periodically. In order to do that properly, you will need to keep define the partitions properly (one per day for example), run the scenario every day for the day before and ensure you push to splunk only new data.

Hope this helps

0 Kudos
sj0071992
Level 3
Level 3
Author

Thanks for your response but my question is something different. 

In the screenshot share by you about the logs, we have a start time and end time column and those are some kind of string. Could you please let me know how to read those columns

0 Kudos
fsergot
Dataiker
Dataiker

CRU data is part of the general Audit log of SS and there are option to dispatch this log outside of DSS. You need to go through this documentation: https://doc.dataiku.com/dss/latest/operations/audit-trail/centralization-and-dispatch.html

In there, you'll see that you can dispatch audit logs of DSS to 3 types of systems: Event Server (that's what we discuss before), a kafka cluster or leveraging log4j. All these options offers capabilities to integrate with Splunk:

  1. Event server is pushing audit logs as files, so you can have splunk reading those files.
  2. There is a Kafka to Splunk connector
  3. log4j is able to push its data into many outputs supported by Splunk

Note that this adds some complexity in the setup and Dataiku does not support the setup below the point of pushing audit log out.

0 Kudos
sj0071992
Level 3
Level 3
Author

Hi @fsergot ,

 

I was just going through the Log files and observed that the CPU utilized were in Milli-Seconds and my requirement is to show the CPU utilization in percentage.

If a local process is running how much CPU is utilized out the the available CPU.

Could you please help me that, like which column should i consider and what will be the formula for that.

 

For your reference below are some CPU data column i am getting in Log files,

 

CPU Utilization Columns

clientEvent.computeResourceUsage.localProcess.cpuCurrent

clientEvent.computeResourceUsage.localProcess.cpuUserTimeMS

clientEvent.computeResourceUsage.localProcess.cpuSystemTimeMS

clientEvent.computeResourceUsage.localProcess.cpuChildrenUserTimeMS

clientEvent.computeResourceUsage.localProcess.cpuChildrenSystemTimeMS

clientEvent.computeResourceUsage.localProcess.cpuTotalMS

 

 

I am not able to figure out the relevant column and the formula to calculate the CPU utilized by a LOCAL_PROCESS in percentage. 

Need your help here!

 

Thanks in Advance

0 Kudos
fsergot
Dataiker
Dataiker

Hello,

Let's not forget that CRU is meant to list what ressource is consumed by each Dataiku-orchestrated processes over its lifetime. What you want to achieve is to group processes consumptions over a given peby time, which is completely different.

Regarding LOCAL data, each line is the ressource consummed by a process spawned by Dataiku, from its start to its ending. In each line, you will find information such as

  • the process id (clientEvent.computeResourceUsage.localProcess.pid)
  • when the process started (clientEvent.computeResourceUsage.startTime)
  • when the process ends (clientEvent.computeResourceUsage.endTime)

The details of the CPU consumption that you have in the filed you are mentioning (clientEvent.computeResourceUsage.localProcess.cpu*) is exctracted from the OS on /proc/[pid]/stat. You can find reference to what are those numbers and how to use them on the Internet.

But again, this is the amount of CPU time that the process has used in its entire lifetime. There is no reliable way to sum that up across all processeses for a given timeframe (like every second or minute). The goal with this is to build reports that highlights what are the most CPU intensive processes over a long period of time and see if/how they can be optimized.

To get back to your first sentence, I would ask what is the goal by tracking the CPU percentage? Do not forget that CRU is not made for IT monitoring. For that, stadnard monitoring tools are way better to control if your server is healthy.

0 Kudos
sj0071992
Level 3
Level 3
Author

Hi @fsergot ,

 

Let me give you a quick background on what we want to achieve from CRU.

We have Dataiku mounted on EC2 instance single Node and we have a Splunk Dashboard for tracking the CPU utilization of our EC2 instance and we have only Dataiku installed on EC2.
Now what is happening in our CPU is utilized almost 99% and due to which some processes are getting affected and we are not able to figure out the Dataiku projects who are consuming a huge amount of CPU.

So we decided to use CRU to track CPU utilization project-wise so that we can deep down to those projects who are consuming excessive CPU and will do some optimizations.

 

I hope this clarify what I need.

Thanks in Advance

0 Kudos
fsergot
Dataiker
Dataiker

Hello @sj0071992 ,

CRU can hardly help you on that I fear. Leveraring Splunk and more detailled monitoring on the machine  might be more efficient (e.g. listing the processes with their name & id when there is a spike in CPU consumption and crossing that with the DSS logs).

However, I am not sure this topic can easily be worked on as a Community thread. I would suggest to discuss this with your CSM and if need be open a support ticket for further investigation.

0 Kudos
sj0071992
Level 3
Level 3
Author

Hi @fsergot ,

 

Thanks for the clarity, I think this is helpful.

Just a question, whatever resource id we have in CRU logs, are those generated from the DATAIKU server or the Linux machine?
The reason I am asking this so that I can use this information to figure out the project details in Splunk CPU monitoring dashboard

 

Thanks in Advance

0 Kudos
fsergot
Dataiker
Dataiker

In the CRU lines for local process ressource, the column clientEvent.computeResourceUsage.localProcess.pid contains the linux process id. This is the one that you can match with any system-level monitoring tool.

0 Kudos
A banner prompting to get Dataiku DSS