Lots of files owned by Dataiku under /tmp

Solved!
gnaldi62
Lots of files owned by Dataiku under /tmp

We've got filesystem full issues due to large files created by Dataiku under the /tmp directory (the machine is a Linux). How is it possible ? Is there any process which writes under that directory ) If yes, how to prevent this ?

Thanks. Rgds.

Giuseppe

0 Kudos
1 Solution
Ignacio_Toledo

Hi @gnaldi62, have you checked the documentation under https://doc.dataiku.com/dss/latest/operations/disk-usage.html

I've been administrating a DSS node that we use for sandboxing and prototyping, and all of the DSS temporary files are under $DATA_DIR/tmp, as it says there.

However, there are other process that DSS can start as dataiku user, and that will write on the root /tmp directory, e.g. I can see some of those files and dirs under /tmp coming from some python processes. But they are not older than 2 weeks, and only take 588 KB.

Anyhow, I keep running the cleanup macros once a day with an scenario, removing anything older than either 15 or 30 days, and that has maintained our filesystem free of huge temporary or log files.

Hope this helps! 

View solution in original post

5 Replies
Ignacio_Toledo

Hi @gnaldi62, have you checked the documentation under https://doc.dataiku.com/dss/latest/operations/disk-usage.html

I've been administrating a DSS node that we use for sandboxing and prototyping, and all of the DSS temporary files are under $DATA_DIR/tmp, as it says there.

However, there are other process that DSS can start as dataiku user, and that will write on the root /tmp directory, e.g. I can see some of those files and dirs under /tmp coming from some python processes. But they are not older than 2 weeks, and only take 588 KB.

Anyhow, I keep running the cleanup macros once a day with an scenario, removing anything older than either 15 or 30 days, and that has maintained our filesystem free of huge temporary or log files.

Hope this helps! 

gnaldi62
Author

Hi,

  almost all the temp files are put into $DATA_DIR/tmp, but as you said, there are a few files

  generated directly under /tmp. Unfortunately some of them are huge (about 1GB) and fill the

  root filesystem. Is there a way to avoid this at all ? Txs. Rgds.

Giuseppe

0 Kudos
Ignacio_Toledo

While someone with more expertise shows up, do you have any idea of what of files are being written? I did a 

ls -alhrt /tmp/* 

to check first for any file or dir created by the dataiku user. Then I run

du -h --max-depth 1 /tmp 

to find the heavy folders (I have none however right now), and finally went into some into these dirs to find out what process is being creating them (that's how I find the python generated ones, for example).

Do you find anything suspicious in that way?

Cheers!

 

0 Kudos
gnaldi62
Author

Hi Ignacio,

  yes, we've found something. There are some packages, namely Python packages, which don't have the concept or don't follow the standard to use the java tmpdir setup.  One example is the xlsxwriter package. When a new Excel file is produced, a temporary copy is saved into the temporary directory /tmp, if not specified otherwise.

We're looking for changing the code, even if we should go through Pandas which acts as a wrapper and doesn't have the same options of the original package.

There might be other packages which do the same, and one cannot notice unless one goes deeper into the code.

Rgds.

 

Giuseppe

Ignacio_Toledo

Interesting @gnaldi62, thanks for sharing!

I was trying to find other examples, related to the use of pyspark, but lately there hasn't been much use of pyspark recipes in our main node, so I can't find anything. But I would keep an eye on that too.

Cheers!

Ignacio

0 Kudos