Large Data Processing is very slow...

Khalid538 · September 2022

One of our Data Scientist is running jobs with some huge Data sets and earlier she was complaining about being very slow so I increased the backend.xmx to 12g . Now also its running but taking lot of time to process.

In the Job log I am seeing this messages .Does it mean error?

5406.922: [GC (Allocation Failure) [PSYoungGen: 670569K->12608K(676864K)] 2011527K->1367487K(2075136K), 0.0228449 secs] [Times: user=0.41 sys=0.15, real=0.02 secs] 
5408.211: [GC (Allocation Failure) [PSYoungGen: 669504K->13950K(676864K)] 2024383K->1381110K(2075136K), 0.0185378 secs] [Times: user=0.40 sys=0.04, real=0.02 secs] 
[2022/09/20-14:49:28.786] [Thread-28] [INFO] [dku.format.csv]  - CSV Emitted 252200000 lines from file, 29 columns - interned: 252256275 MEM: 100.0%.

Operating system used: Cent OS

Emma · September 2022

Hey @Khalid538
,

The "[GC (Allocation Failure)" are garbage collection tasks... If there are many of them it could indicate that the Dataiku instance is under heavy stress and the MEM: 100.0% does indicate an error.

I recommend that you open a ticket with Dataiku support (support@dataiku.com) with logs and job diags attached, please!

To get a Job Diag:

From the job page, click on Actions > Download job diagnosis.
If the resulting file is too large for mail (> 15 MB), you can use https://dl.dataiku.com to send it to us. Please don't forget to send the link that is generated when you upload the file.

Emma

Khalid538 · September 2022

Thanks Emma for the response.

I did open a Support Ticket with Dataiku and he suggested me with the following but still i am not clear on resolving this issue.

Response:

Hi,

Your DS is pulling massive data (706M rows taking 260Gb size uncompressed) which is impossible to perform Group recipe using DSS engine.

Ony reading of the data took around 4 hours. The rest of the time the job was doing nothing as it crashed due to the impossibility to fit that data into memory.

DS needs to rethink/redesign the flow for this Group recipe like, possibly, syncing the data into relational DB and using the In-Database SQL engine:

https://doc.dataiku.com/dss/latest/preparation/engines.html#in-database-sql

Sergey,

Technical support engineer, Dataiku

Can anyone help me out?

Sergey · September 2022

@Khalid538

As mentioned in the support ticket you raised, pulling that amount of data into DSS engine will cause OOM errors. Sync the data into relational DBs first and then perform group recipe with the In-database engine (so the computation will happen in the database). Alternatively, use the Spark engine to offload the computation.

Side note: While using the DSS engine for group recipes with local or cloud FS connections, all the data will be copied over to internal H2 implementation taking local disk space so be careful with this as you can easily run out of space.

The recommendation remains the same: In-database or Spark engine.

Sergey · September 2022

To add to my previous reply: Increasing backend.xmx will not help as well as the crash is happening in the JEK.

Large Data Processing is very slow...

Answers

Categories

Setup Info

Tags