Want to Stop Rebuilding "Expensive" Parts of your Flow? Explicit Builds are the Answer!READ MORE

Job Process Dies (Out of Memory)

jxh
Level 2
Job Process Dies (Out of Memory)

I am trying to process a dataset with about 6 million rows and 134 columns. The data is stored as a compressed CSV and uploaded directly to the Dataiku server. Since it doesn't make sense to read the data all in memory, I am using a Python recipe to process the data in chunks. However, this doesn't seem to be working as the job dies on the same line despite whatever chunk size is used. 

The job process dies after the following:

[dku.format.csv] - CSV Emitted 2900000 lines from file, 134 columns - interned: 214379517 MEM: 60.203240867893435%

and throws a java.lang.OutOfMemoryError: Java heap space

Is there something I am doing incorrectly? Are there alternatives to processing a large dataset? I am new to Dataiku and was under the impression it was good at handling big data tasks. 

0 Kudos
3 Replies
tgb417
Neuron
Neuron

@jxh 

Welcome to the Dataiku community.  Given the correct resources a Dataiku Environment can handle large datasets.  I’m aware of some organizations doing at least billion record sized datasets. 

To help the community help you, it would be great if you could share with us a bit more about the server you are working from.  In particular:

  • Are you hosting the server yourself?  Or using Dataiku cloud offering? 
  • If you are hosting yourself:
    • How much available memory does your server have available? 
    • What OS is your server running? 
    • Is your server configured to use Swap.

Looking forward to hearing more about your infrastructure, and welcome to the community.  

--Tom
0 Kudos
jxh
Level 2
Author

Hi, it is a server that is managed by an organization. It looks like I figured out the issue though. Seems as though Dataiku has issues reading from large compressed CSVs that are uploaded directly to Dataiku. The memory issue looks like it has been resolved when I switched to reading from an external database that has the same dataset.

tgb417
Neuron
Neuron

@jxh 

Glad you found your answer.  I’m a strong proponent of using a SQL database with Dataiku DSS.  This allows much of the data work to be done inside the database.  I’ll often sync my .ZIPed CSV files to a database and then do my analysis and other modeling.  

 

--Tom
0 Kudos