CPU Utilization Projectwise

Solved!
sj0071992
CPU Utilization Projectwise

 Hi Team,

 

Could you please let me know the way to see how much CPU is Utilized by a project? Also, how can o read it from Logs

 

Thanks in Advance

1 Solution
fsergot
Dataiker

CRU data is part of the general Audit log of SS and there are option to dispatch this log outside of DSS. You need to go through this documentation: https://doc.dataiku.com/dss/latest/operations/audit-trail/centralization-and-dispatch.html

In there, you'll see that you can dispatch audit logs of DSS to 3 types of systems: Event Server (that's what we discuss before), a kafka cluster or leveraging log4j. All these options offers capabilities to integrate with Splunk:

  1. Event server is pushing audit logs as files, so you can have splunk reading those files.
  2. There is a Kafka to Splunk connector
  3. log4j is able to push its data into many outputs supported by Splunk

Note that this adds some complexity in the setup and Dataiku does not support the setup below the point of pushing audit log out.

View solution in original post

0 Kudos
19 Replies
fsergot
Dataiker

Hello @sj0071992 ,

This is a vast topic, especially considering the many ways CPU can be consumed by DSS projects (directly on the DSS server, through elastic AI & Kubernetes, in SQL processing ....).

We do have a built-in feature called Compute Resource Usage in DSS that keeps track of CPU consumed with as much lineage as possible (like when, which user, which project, which recipe, webapp or notebook...)

Here is the documentation on this topics -> https://doc.dataiku.com/dss/latest/operations/compute-resource-usage-reporting.html

I would encourage you to read that and activate it to see the data you have on your platform. Raw data is usually not enough so you will need some processing within a DSS project to produce some reports that may suits your needs.

0 Kudos
sj0071992
Author

Hi,

 

Thanks for your response.

I have the Logs, could you please let me know the Identifier which represents the CPU Utilization as the Logs are huge and I am unable to figure out that.

 

I found something like

[2021/08/26-14:38:01.769] [JEK-QVBCnR3u-log-1114828] [INFO] [dku.jobs.kernel]  - 2.897: [Full GC (Metadata GC Threshold) [PSYoungGen: 20579K->0K(611840K)] [ParOldGen: 16K->17992K(774656K)] 20595K->17992K(1386496K), [Metaspace: 21082K->21082K(1069056K)], 0.1439040 secs] [Times: user=1.97 sys=0.31, real=0.14 secs]

am I following the correct path?

 

Thanks in Advance

0 Kudos
fsergot
Dataiker

Hello @sj0071992 ,

This is not the type of log I was mentioning. However, I am wondering which kind of setup you have: do you have a commercial licence or a community one? Do you use Kubernetes, especially Spark on Kubernetes or SQL engines or is all your projects done using DSS own engine?

 

0 Kudos
sj0071992
Author

Hi,

 

We are using Commercial and regarding engine we are using In-Database as we Snowflake Datasources.

Thanks in Advance

0 Kudos
fsergot
Dataiker

So what you were looking at are the logs of the local DSS server. You can probably extract some data from there but it will be partial and you won't be able to link it to project/users.

CRU requires some setup to process, although it is always activated, to summarise:

  1. You need to configure your DSS instance to act as an Event server (see https://doc.dataiku.com/dss/latest/operations/audit-trail/eventserver.html)
  2. You need to configure the Event Server in Administration > Settings > Event server so that all data retrieved are stored in a file system (In the field 'Connection name', this needs to be an existing connection of your DSS server).

Once there, you can create a project in DSS, read from this connection and you will see actual ressource consumption and details its context (time, project, user, type of ressource, quantity of resource used...)

Screenshot 2021-09-07 at 17.39.21.png

0 Kudos
sj0071992
Author

Hi,

 

I have a question here, how to read this start time and end time, this is something a string value and how to interpret it?

 

Thanks in Advance

0 Kudos
fsergot
Dataiker

Hello,

For this question (my previous post was on an older one): those are EPOCH timestamps so you can convert them to data time easily in a prepare recipe for example (using the convert UNIX timestamp processor).

 

0 Kudos
sj0071992
Author

Hi,

 

Thanks for your response.

This is really helpful. But can we automate this? As I want to use this data in my Splunk Dashboard.

 

Thanks in Advance

0 Kudos
fsergot
Dataiker

Hello,

 

I am sorry I missed your question!

The answer depends on what you want to automate.

The original data of the CRU processing is stored in a file system defined in DSS event Server. You can source Splunk from there directly. This means however that you will need to do all the processing in Splunk.

Another option is to build a DSS flow that suits your needs, add a final step to push the data to splunk (using splunk own capabilities or Dataiku Splunk plugin) and then create a scenario that builds the output dataset periodically. In order to do that properly, you will need to keep define the partitions properly (one per day for example), run the scenario every day for the day before and ensure you push to splunk only new data.

Hope this helps

0 Kudos
sj0071992
Author

Thanks for your response but my question is something different. 

In the screenshot share by you about the logs, we have a start time and end time column and those are some kind of string. Could you please let me know how to read those columns

0 Kudos
fsergot
Dataiker

CRU data is part of the general Audit log of SS and there are option to dispatch this log outside of DSS. You need to go through this documentation: https://doc.dataiku.com/dss/latest/operations/audit-trail/centralization-and-dispatch.html

In there, you'll see that you can dispatch audit logs of DSS to 3 types of systems: Event Server (that's what we discuss before), a kafka cluster or leveraging log4j. All these options offers capabilities to integrate with Splunk:

  1. Event server is pushing audit logs as files, so you can have splunk reading those files.
  2. There is a Kafka to Splunk connector
  3. log4j is able to push its data into many outputs supported by Splunk

Note that this adds some complexity in the setup and Dataiku does not support the setup below the point of pushing audit log out.

0 Kudos
sj0071992
Author

Hi @fsergot ,

 

I was just going through the Log files and observed that the CPU utilized were in Milli-Seconds and my requirement is to show the CPU utilization in percentage.

If a local process is running how much CPU is utilized out the the available CPU.

Could you please help me that, like which column should i consider and what will be the formula for that.

 

For your reference below are some CPU data column i am getting in Log files,

 

CPU Utilization Columns

clientEvent.computeResourceUsage.localProcess.cpuCurrent

clientEvent.computeResourceUsage.localProcess.cpuUserTimeMS

clientEvent.computeResourceUsage.localProcess.cpuSystemTimeMS

clientEvent.computeResourceUsage.localProcess.cpuChildrenUserTimeMS

clientEvent.computeResourceUsage.localProcess.cpuChildrenSystemTimeMS

clientEvent.computeResourceUsage.localProcess.cpuTotalMS

 

 

I am not able to figure out the relevant column and the formula to calculate the CPU utilized by a LOCAL_PROCESS in percentage. 

Need your help here!

 

Thanks in Advance

0 Kudos
fsergot
Dataiker

Hello,

Let's not forget that CRU is meant to list what ressource is consumed by each Dataiku-orchestrated processes over its lifetime. What you want to achieve is to group processes consumptions over a given peby time, which is completely different.

Regarding LOCAL data, each line is the ressource consummed by a process spawned by Dataiku, from its start to its ending. In each line, you will find information such as

  • the process id (clientEvent.computeResourceUsage.localProcess.pid)
  • when the process started (clientEvent.computeResourceUsage.startTime)
  • when the process ends (clientEvent.computeResourceUsage.endTime)

The details of the CPU consumption that you have in the filed you are mentioning (clientEvent.computeResourceUsage.localProcess.cpu*) is exctracted from the OS on /proc/[pid]/stat. You can find reference to what are those numbers and how to use them on the Internet.

But again, this is the amount of CPU time that the process has used in its entire lifetime. There is no reliable way to sum that up across all processeses for a given timeframe (like every second or minute). The goal with this is to build reports that highlights what are the most CPU intensive processes over a long period of time and see if/how they can be optimized.

To get back to your first sentence, I would ask what is the goal by tracking the CPU percentage? Do not forget that CRU is not made for IT monitoring. For that, stadnard monitoring tools are way better to control if your server is healthy.

0 Kudos
sj0071992
Author

Hi @fsergot ,

 

Let me give you a quick background on what we want to achieve from CRU.

We have Dataiku mounted on EC2 instance single Node and we have a Splunk Dashboard for tracking the CPU utilization of our EC2 instance and we have only Dataiku installed on EC2.
Now what is happening in our CPU is utilized almost 99% and due to which some processes are getting affected and we are not able to figure out the Dataiku projects who are consuming a huge amount of CPU.

So we decided to use CRU to track CPU utilization project-wise so that we can deep down to those projects who are consuming excessive CPU and will do some optimizations.

 

I hope this clarify what I need.

Thanks in Advance

0 Kudos
fsergot
Dataiker

Hello @sj0071992 ,

CRU can hardly help you on that I fear. Leveraring Splunk and more detailled monitoring on the machine  might be more efficient (e.g. listing the processes with their name & id when there is a spike in CPU consumption and crossing that with the DSS logs).

However, I am not sure this topic can easily be worked on as a Community thread. I would suggest to discuss this with your CSM and if need be open a support ticket for further investigation.

0 Kudos
sj0071992
Author

Hi @fsergot ,

 

Thanks for the clarity, I think this is helpful.

Just a question, whatever resource id we have in CRU logs, are those generated from the DATAIKU server or the Linux machine?
The reason I am asking this so that I can use this information to figure out the project details in Splunk CPU monitoring dashboard

 

Thanks in Advance

0 Kudos
fsergot
Dataiker

In the CRU lines for local process ressource, the column clientEvent.computeResourceUsage.localProcess.pid contains the linux process id. This is the one that you can match with any system-level monitoring tool.

0 Kudos
sj0071992
Author

Hi @fsergot ,

 

Could you please help me understand how to interpret which activity is most CPU intensive?

 

Is it like the job having more clientEvent.computeResourceUsage.localProcess.cpuTotalMS is the CPU intensive one? or do we have to consider some other column?

 

Thanks in Advance

0 Kudos
fsergot
Dataiker

Hello,

The way CRU works is that various processes orchestrated by dataiku are sending reports on a regular basis containing their ressource consumption. Part of that is indeed the CPU which have several level of details. The aggregated value is stored in cpuTotalMS.

Each job will usually generate 3 types of messages (indicated by the column clientEvent.msgType): 

  • compute-resource-usage-start -> sent at the start of the job (cpuTotalMS is usually very low or at 0 at this stage)
  • compute-resource-usage-update -> sent regularly as long as the job is running, cpuTotalMS is increasing as the job is consuming ressources
  • compute-resource-usage-complete -> sent when the job is finished. cpuTotalMS will indicate the total cpu time consumed by this job over its lifetime

That being said, identifying the most "cpu intensive" job can have multiple meaning.

You can look at the jobs that are consuming the most cpuTotalMS (so only with the line msgType = complete). But a higher metric does not mean the job is overloading the machine CPU at a given time, it depends also on the duration of the job.

To refine that, you can cross with another metric wich is the cpuTotalMS divided by the job lifetime, as an "average cpu consumption per ms".

On top of that, you can also use the metric cpuCurrent: it is computed as the average consumption of CPU over the last period of mesurement (if the update is send every minute: cpuTotalMS(now) - cpuTotalMS(1 minute ago) / 1000). It can give a proxy to a cpu monitoring system (but very rough).

Besides those 3 metrics, you can also do this another way around: if you have a precise time when your machine is overloaded, you can look at all the activities that were running at that given time and see the cpuCurrent just for that period, it might help pinpoint the culprit.

 

Again, what you are trying to achieve is not what CRU has been made for so this might not be 100% accurate nor easy to do it. Analyzing also on the machine itself when the issue occurs is recommended: what are the processes running, what is effectively slowing the machine (it might the cpu from dss but it might also be another process, or I/O, or RAM).

 

0 Kudos