Lots of files owned by Dataiku under /tmp

gnaldi62
gnaldi62 Partner, L2 Designer, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 79 Neuron

We've got filesystem full issues due to large files created by Dataiku under the /tmp directory (the machine is a Linux). How is it possible ? Is there any process which writes under that directory ) If yes, how to prevent this ?

Thanks. Rgds.

Giuseppe

Tagged:

Best Answer

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 414 Neuron
    Answer ✓

    Hi @gnaldi62
    , have you checked the documentation under https://doc.dataiku.com/dss/latest/operations/disk-usage.html?

    I've been administrating a DSS node that we use for sandboxing and prototyping, and all of the DSS temporary files are under $DATA_DIR/tmp, as it says there.

    However, there are other process that DSS can start as dataiku user, and that will write on the root /tmp directory, e.g. I can see some of those files and dirs under /tmp coming from some python processes. But they are not older than 2 weeks, and only take 588 KB.

    Anyhow, I keep running the cleanup macros once a day with an scenario, removing anything older than either 15 or 30 days, and that has maintained our filesystem free of huge temporary or log files.

    Hope this helps!

Answers

  • gnaldi62
    gnaldi62 Partner, L2 Designer, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 79 Neuron

    Hi,

    almost all the temp files are put into $DATA_DIR/tmp, but as you said, there are a few files

    generated directly under /tmp. Unfortunately some of them are huge (about 1GB) and fill the

    root filesystem. Is there a way to avoid this at all ? Txs. Rgds.

    Giuseppe

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 414 Neuron
    edited July 17

    While someone with more expertise shows up, do you have any idea of what of files are being written? I did a

    ls -alhrt /tmp/* 

    to check first for any file or dir created by the dataiku user. Then I run

    du -h --max-depth 1 /tmp 

    to find the heavy folders (I have none however right now), and finally went into some into these dirs to find out what process is being creating them (that's how I find the python generated ones, for example).

    Do you find anything suspicious in that way?

    Cheers!

  • gnaldi62
    gnaldi62 Partner, L2 Designer, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 79 Neuron

    Hi Ignacio,

    yes, we've found something. There are some packages, namely Python packages, which don't have the concept or don't follow the standard to use the java tmpdir setup. One example is the xlsxwriter package. When a new Excel file is produced, a temporary copy is saved into the temporary directory /tmp, if not specified otherwise.

    We're looking for changing the code, even if we should go through Pandas which acts as a wrapper and doesn't have the same options of the original package.

    There might be other packages which do the same, and one cannot notice unless one goes deeper into the code.

    Rgds.

    Giuseppe

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 414 Neuron

    Interesting @gnaldi62
    , thanks for sharing!

    I was trying to find other examples, related to the use of pyspark, but lately there hasn't been much use of pyspark recipes in our main node, so I can't find anything. But I would keep an eye on that too.

    Cheers!

    Ignacio

Setup Info
    Tags
      Help me…