Job Process Dies (Out of Memory)

jxh
jxh Registered Posts: 6 ✭✭✭✭

I am trying to process a dataset with about 6 million rows and 134 columns. The data is stored as a compressed CSV and uploaded directly to the Dataiku server. Since it doesn't make sense to read the data all in memory, I am using a Python recipe to process the data in chunks. However, this doesn't seem to be working as the job dies on the same line despite whatever chunk size is used.

The job process dies after the following:

[dku.format.csv] - CSV Emitted 2900000 lines from file, 134 columns - interned: 214379517 MEM: 60.203240867893435%

and throws a java.lang.OutOfMemoryError: Java heap space

Is there something I am doing incorrectly? Are there alternatives to processing a large dataset? I am new to Dataiku and was under the impression it was good at handling big data tasks.

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @jxh

    Welcome to the Dataiku community. Given the correct resources a Dataiku Environment can handle large datasets. I’m aware of some organizations doing at least billion record sized datasets.

    To help the community help you, it would be great if you could share with us a bit more about the server you are working from. In particular:

    • Are you hosting the server yourself? Or using Dataiku cloud offering?
    • If you are hosting yourself:
      • How much available memory does your server have available?
      • What OS is your server running?
      • Is your server configured to use Swap.

    Looking forward to hearing more about your infrastructure, and welcome to the community.

  • jxh
    jxh Registered Posts: 6 ✭✭✭✭

    Hi, it is a server that is managed by an organization. It looks like I figured out the issue though. Seems as though Dataiku has issues reading from large compressed CSVs that are uploaded directly to Dataiku. The memory issue looks like it has been resolved when I switched to reading from an external database that has the same dataset.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @jxh

    Glad you found your answer. I’m a strong proponent of using a SQL database with Dataiku DSS. This allows much of the data work to be done inside the database. I’ll often sync my .ZIPed CSV files to a database and then do my analysis and other modeling.

Setup Info
    Tags
      Help me…