ERROR: No space left on device when I execute a join

rafael_rosado97
rafael_rosado97 Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 61 Partner

Hello, everyone.

I see the error when I make a join with three tables that are on S3. I can not use SQL engine because S3 is not compatible due to is not a SQL database, right? So I run it on DSS engine. The process take around 5 hours and it appers "ERROR: No space left on device when I execute a join".
If I run the join with DSS engine, the data is stored locally and for that reason the space is not enough?

How can I make the process faster?

Is it possible to fix this across partitions? That is, partition the data from each of the tables on a monthly basis and that dataiku automatically unites only the new information.
Is this possible?

Thank you very much!!


Operating system used: Linux

Tagged:

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,107 Neuron

    "If I run the join with DSS engine, the data is stored locally and for that reason the space is not enough?" => Correct.

    "How can I make the process faster?" => Use a SQL backend or a distributed compute engine like Spark. Use a bigger server with fadter cores.

    "Is it possible to fix this across partitions? That is, partition the data from each of the tables on a monthly basis and that dataiku automatically unites only the new information. Is this possible?" => Most likely no. You can't have a join work partially across partitions.

  • rafael_rosado97
    rafael_rosado97 Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 61 Partner

    Thank you so much for your answer, @Turribeach
    .

    Could Amazon Athena be used as SQL backend?

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,107 Neuron

    You can use Athena but there are some limitations, see:

    https://doc.dataiku.com/dss/latest/connecting/sql/athena.html

  • rafael_rosado97
    rafael_rosado97 Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 61 Partner

    Got it, @Turribeach

    When DSS engine is used I have already understood that data is stored locally. But is it temporally or permanent? Because the output is saved on S3.

    I am asking this question to see if increasing the disk memory would solve the problem. Is the memory used during processing freed? I am aware that it would be better to use a SQL backend, I just wanted to know this.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,107 Neuron

    You say "disk memory". Disk and memory are different things. Disk is storage where you save files permanently. Memory is what your computer uses to store data temporarily in RAM. When you run a join using the DSS engine the DSS server has to first pull all the data into disk and then attempt to join it in memory. In other words it will use both disk and memory (and the CPU too). Having said that in both cases the disk space and memory used will be released when the job finishes, either successfully or not.

  • rafael_rosado97
    rafael_rosado97 Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 61 Partner

    Yes. Sorry, I did not write correctly my ideas.

    I have already understood everything and your answer was very useful.

    Thank you very much, @Turribeach
    .

Setup Info
    Tags
      Help me…