Heavy datasets to recipe

Feldunoob
Level 2
Heavy datasets to recipe

Hello,

I currently have the following :

- SFTP Server which contains usual data that i wat to analyze.
- Dataiku on another server which has a SFTP link.

I added 2 datasets 5 Gb and 200 Gb, both on the SFTP server and started a recipe (Fuse from LEFT) and launched it. In the end it gets into loop and nothing gets out of it, dataiku server simply stays idle.

Logs file on csv is just a simple text file, which contains the output of dataiku job.

 

To solve :
Should i upload directly the largest file into a dataiku folder before doing recipe ?
The data seems to have remained on server since disk use is unusualy high, i launched the job to clean data inserted in dataiku and uncompleted. No changes.

 

0 Kudos
5 Replies
Clรฉment_Stenac

Hi,

The logs show that the recipe was still running, and was likely going OK.

However, trying to use the fallback DSS engine to perform a join on this kind of size would likely require dozens of hours and multiple terabytes of disk space.

We would highly advise you to connect DSS to a SQL database, or to a Spark cluster in order to benefit from optimized join engines, as the DSS engine is mostly designed for small/medium data.

Best,

0 Kudos
Feldunoob
Level 2
Author

Actually the data was sql database .sqlite, as student on it i am unable to connect on it since it's required to upgrade on Lite version to get the connector. I could convert the database to mysql instead of CSV though ... which is still allowed.

By medium data sizes, you mean how much Gigabytes ?

0 Kudos
Clรฉment_Stenac

Hi,

There is no specific limitation, it is more than it would be significantly slower. MySQL would likely be an adequate option here, though you would still need a pretty sizeable MySQL server.

0 Kudos
Feldunoob
Level 2
Author

It seems i am stuck with dataiku because df command shows me the old data that should be deleted (dropped data) isn't considered as free.

So i launched the following :

- Administration => Maintenance => Scheduled tasks
- clear-upload-box job

But it didn't had any impact on df result, still showing 73% disk usage while the proper data usage is lower.
Virtual disk is in dynamic / thin mode so it should be marked down as free space automatically.

It's quite strange anyway, it looks like the old 200Gb files weren't deleted, and i believe it has something to do with the build up that i aborted twice while doing the recipe.
How can i clean up properly the unused data on mysql so ?

Edit :
I identified the files with a :  "du -a / | sort -n -r | head -n 20"
And found out that it's stored in datadir/jobs as "compute-*" named files with "joined" name on it, si it seems it didn't delete properly the unused files.
What is the proper script to launch in order to clean up the aborted jobs files ?

0 Kudos
Clรฉment_Stenac

Hi,

Indeed, while Dataiku tries to cleanup jobs during abort, if the abort fails to cleanup in time, DSS will leave the files around.

You can manually remove the folder jobs/YOUR_PROJECT_KEY/YOUR_JOB_ID to remove the files. This will not cause any issue.