Job Process Dies (Out of Memory)

jxh · December 2021

I am trying to process a dataset with about 6 million rows and 134 columns. The data is stored as a compressed CSV and uploaded directly to the Dataiku server. Since it doesn't make sense to read the data all in memory, I am using a Python recipe to process the data in chunks. However, this doesn't seem to be working as the job dies on the same line despite whatever chunk size is used.

The job process dies after the following:

[dku.format.csv] - CSV Emitted 2900000 lines from file, 134 columns - interned: 214379517 MEM: 60.203240867893435%

and throws a java.lang.OutOfMemoryError: Java heap space

Is there something I am doing incorrectly? Are there alternatives to processing a large dataset? I am new to Dataiku and was under the impression it was good at handling big data tasks.

tgb417 · December 2021

@jxh

Welcome to the Dataiku community. Given the correct resources a Dataiku Environment can handle large datasets. I’m aware of some organizations doing at least billion record sized datasets.

To help the community help you, it would be great if you could share with us a bit more about the server you are working from. In particular:

Are you hosting the server yourself? Or using Dataiku cloud offering?
If you are hosting yourself:
- How much available memory does your server have available?
- What OS is your server running?
- Is your server configured to use Swap.

Looking forward to hearing more about your infrastructure, and welcome to the community.

jxh · December 2021

Hi, it is a server that is managed by an organization. It looks like I figured out the issue though. Seems as though Dataiku has issues reading from large compressed CSVs that are uploaded directly to Dataiku. The memory issue looks like it has been resolved when I switched to reading from an external database that has the same dataset.

tgb417 · December 2021

@jxh

Glad you found your answer. I’m a strong proponent of using a SQL database with Dataiku DSS. This allows much of the data work to be done inside the database. I’ll often sync my .ZIPed CSV files to a database and then do my analysis and other modeling.

Job Process Dies (Out of Memory)

Answers

Categories

Setup Info

Tags