Joins recipe on large datasets causing issue.

Options
Ankur30
Ankur30 Partner, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer Posts: 40 Partner

Hi @AlexT
,

I am using join recipe and there i am joining tow datasets one of the dataset has 7M records. It is caching that dataset in memory and running for longer period of time and later I am getting out of space issue.

Kindly help me how I can resolve this issue.

Regards,

Ankur.

Best Answer

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Answer ✓
    Options

    Python recipe can work if you use pandas you need to have enough RAM available for your recipe to fit into memory all datasets your are joining + output dataset.

Answers

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,209 Dataiker
    Options

    Hi @Ankur30
    ,

    Joins with large datasets are not recommended not to be done with the DSS engine. It can reqire a significant amount of temporary disk space to be available in the DATADIR.

    You should SQL Engine or Spark. You can sync your datasets to SQL Database and do the join with SQL engine instead.

    If that's not an option you should try:

    - clean up space https://doc.dataiku.com/dss/latest/operations/disk-usage.html or

    - add additional disk space to allow for the join to succeed.

    If jobs will create temp files under DATADIR/jobs/PROJECTID/RECIPENAME...

  • Ankur30
    Ankur30 Partner, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer Posts: 40 Partner
    Options

    Hi @AlexT
    ,

    Thank you for above response , I will try that.

    Will using python recipe to join two dataset also works in above case.

    Regards,

    Ankur.

Setup Info
    Tags
      Help me…