ERROR: No space left on device when I execute a join

rafael_rosado97 · ‎08-23-2023

Hello, everyone.

I see the error when I make a join with three tables that are on S3. I can not use SQL engine because S3 is not compatible due to is not a SQL database, right? So I run it on DSS engine. The process take around 5 hours and it appers "ERROR: No space left on device when I execute a join".
If I run the join with DSS engine, the data is stored locally and for that reason the space is not enough?

How can I make the process faster?

Is it possible to fix this across partitions? That is, partition the data from each of the tables on a monthly basis and that dataiku automatically unites only the new information.
Is this possible?

Thank you very much!!

Operating system used: Linux

Turribeach · ‎08-23-2023

"If I run the join with DSS engine, the data is stored locally and for that reason the space is not enough?" => Correct.

"How can I make the process faster?" => Use a SQL backend or a distributed compute engine like Spark. Use a bigger server with fadter cores.

"Is it possible to fix this across partitions? That is, partition the data from each of the tables on a monthly basis and that dataiku automatically unites only the new information. Is this possible?" => Most likely no. You can't have a join work partially across partitions.

rafael_rosado97 · ‎08-23-2023

Thank you so much for your answer, @Turribeach .

Could Amazon Athena be used as SQL backend?

Turribeach · ‎08-23-2023

You can use Athena but there are some limitations, see:

https://doc.dataiku.com/dss/latest/connecting/sql/athena.html

rafael_rosado97 · ‎08-23-2023

Got it, @Turribeach

When DSS engine is used I have already understood that data is stored locally. But is it temporally or permanent? Because the output is saved on S3.

I am asking this question to see if increasing the disk memory would solve the problem. Is the memory used during processing freed? I am aware that it would be better to use a SQL backend, I just wanted to know this.

Turribeach · ‎08-23-2023

You say "disk memory". Disk and memory are different things. Disk is storage where you save files permanently. Memory is what your computer uses to store data temporarily in RAM. When you run a join using the DSS engine the DSS server has to first pull all the data into disk and then attempt to join it in memory. In other words it will use both disk and memory (and the CPU too). Having said that in both cases the disk space and memory used will be released when the job finishes, either successfully or not.

rafael_rosado97 · ‎08-26-2023

Yes. Sorry, I did not write correctly my ideas.

I have already understood everything and your answer was very useful.

Thank you very much, @Turribeach.

Sign up to take part

ERROR: No space left on device when I execute a join

ERROR: No space left on device when I execute a join

Setup info