Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I am trying to process a dataset with about 6 million rows and 134 columns. The data is stored as a compressed CSV and uploaded directly to the Dataiku server. Since it doesn't make sense to read the data all in memory, I am using a Python recipe to process the data in chunks. However, this doesn't seem to be working as the job dies on the same line despite whatever chunk size is used.
The job process dies after the following:
[dku.format.csv] - CSV Emitted 2900000 lines from file, 134 columns - interned: 214379517 MEM: 60.203240867893435%
and throws a java.lang.OutOfMemoryError: Java heap space
Is there something I am doing incorrectly? Are there alternatives to processing a large dataset? I am new to Dataiku and was under the impression it was good at handling big data tasks.
Welcome to the Dataiku community. Given the correct resources a Dataiku Environment can handle large datasets. I’m aware of some organizations doing at least billion record sized datasets.
To help the community help you, it would be great if you could share with us a bit more about the server you are working from. In particular:
Looking forward to hearing more about your infrastructure, and welcome to the community.
Hi, it is a server that is managed by an organization. It looks like I figured out the issue though. Seems as though Dataiku has issues reading from large compressed CSVs that are uploaded directly to Dataiku. The memory issue looks like it has been resolved when I switched to reading from an external database that has the same dataset.
Glad you found your answer. I’m a strong proponent of using a SQL database with Dataiku DSS. This allows much of the data work to be done inside the database. I’ll often sync my .ZIPed CSV files to a database and then do my analysis and other modeling.