We've got filesystem full issues due to large files created by Dataiku under the /tmp directory (the machine is a Linux). How is it possible ? Is there any process which writes under that directory ) If yes, how to prevent this ?
Hi @gnaldi62, have you checked the documentation under https://doc.dataiku.com/dss/latest/operations/disk-usage.html?
I've been administrating a DSS node that we use for sandboxing and prototyping, and all of the DSS temporary files are under $DATA_DIR/tmp, as it says there.
However, there are other process that DSS can start as dataiku user, and that will write on the root /tmp directory, e.g. I can see some of those files and dirs under /tmp coming from some python processes. But they are not older than 2 weeks, and only take 588 KB.
Anyhow, I keep running the cleanup macros once a day with an scenario, removing anything older than either 15 or 30 days, and that has maintained our filesystem free of huge temporary or log files.
Hope this helps!
almost all the temp files are put into $DATA_DIR/tmp, but as you said, there are a few files
generated directly under /tmp. Unfortunately some of them are huge (about 1GB) and fill the
root filesystem. Is there a way to avoid this at all ? Txs. Rgds.
While someone with more expertise shows up, do you have any idea of what of files are being written? I did a
ls -alhrt /tmp/*
to check first for any file or dir created by the dataiku user. Then I run
du -h --max-depth 1 /tmp
to find the heavy folders (I have none however right now), and finally went into some into these dirs to find out what process is being creating them (that's how I find the python generated ones, for example).
Do you find anything suspicious in that way?
yes, we've found something. There are some packages, namely Python packages, which don't have the concept or don't follow the standard to use the java tmpdir setup. One example is the xlsxwriter package. When a new Excel file is produced, a temporary copy is saved into the temporary directory /tmp, if not specified otherwise.
We're looking for changing the code, even if we should go through Pandas which acts as a wrapper and doesn't have the same options of the original package.
There might be other packages which do the same, and one cannot notice unless one goes deeper into the code.
Interesting @gnaldi62, thanks for sharing!
I was trying to find other examples, related to the use of pyspark, but lately there hasn't been much use of pyspark recipes in our main node, so I can't find anything. But I would keep an eye on that too.