Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I see the error when I make a join with three tables that are on S3. I can not use SQL engine because S3 is not compatible due to is not a SQL database, right? So I run it on DSS engine. The process take around 5 hours and it appers "ERROR: No space left on device when I execute a join".
If I run the join with DSS engine, the data is stored locally and for that reason the space is not enough?
How can I make the process faster?
Is it possible to fix this across partitions? That is, partition the data from each of the tables on a monthly basis and that dataiku automatically unites only the new information.
Is this possible?
Thank you very much!!
Operating system used: Linux
"If I run the join with DSS engine, the data is stored locally and for that reason the space is not enough?" => Correct.
"How can I make the process faster?" => Use a SQL backend or a distributed compute engine like Spark. Use a bigger server with fadter cores.
"Is it possible to fix this across partitions? That is, partition the data from each of the tables on a monthly basis and that dataiku automatically unites only the new information. Is this possible?" => Most likely no. You can't have a join work partially across partitions.
Got it, @Turribeach
When DSS engine is used I have already understood that data is stored locally. But is it temporally or permanent? Because the output is saved on S3.
I am asking this question to see if increasing the disk memory would solve the problem. Is the memory used during processing freed? I am aware that it would be better to use a SQL backend, I just wanted to know this.
You say "disk memory". Disk and memory are different things. Disk is storage where you save files permanently. Memory is what your computer uses to store data temporarily in RAM. When you run a join using the DSS engine the DSS server has to first pull all the data into disk and then attempt to join it in memory. In other words it will use both disk and memory (and the CPU too). Having said that in both cases the disk space and memory used will be released when the job finishes, either successfully or not.