Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Currently still into learning curve, i already tested the enterprise license so i am into the free version (practicing in my own home lab).
I have a huge (500+ Gb) file that i splitted manually with split command on linux terminal, which was successfully done, that was neccessary as the Dataiku server was running out of memory after a while. So after that i manually import a dataset, a 3Gb one, then split the values inside to make it clean with my preferences, that makes a new dataset inside the virtual machine (dataiku server).
Then from that clean dataset, i am exporting (simple file download) and there is a problem, it's been two hours, and it's currently ONLY at 390 Mb downloaded so far.
The goal is to test out with a simple SQLite database export, but i'm afraid that it's going to take forever and i can't test anyway as it requires an enterprise license.
So here i am, asking if there is a better way to process ?
There is no urgency, it's all about testings.
What is the purpose of the test? What are you trying to prove? It’s a common pattern to throw the data at Dataiku and then export gigabytes of data out. Also it sounds you are using the free version and poor IO technologies like file system datasets and SQLite. If you are looking to achieve high IO you should at offloading your data to a layer that can do that. For instance Snowflake or GCP’s BigQuery. You will be able to move gigabytes of data in seconds using those technologies.
Test is to make it the fastest way possible to export the desired data with free licence.
It is a two column file export that will be ingested in the end by a SQLite database with indexed column.
Even though i am in nvme SSD hard drives, it is still good long. So i am looking for solutions.
The fact you are using an NVMe drive doesn't really mean much. There are huge variances in performance between different NVMe drives, there are different NVMe versions (1.0, 1.x, 2.0, etc). Then you you have PCIe Gen 2.0 and PCIe Gen 3.0 X2 and X4. In any case you are never going to achieve high IO using something like SQLite whch is a lightweight database.
I suspect you are probably running Dataiku in a PC Desktop VM probably with PC desktop / laptop hardware. If that's the case then I see no point in your tests as this are relevant for Dataiku running in a more realistic scenario inside server grade hardware.
Finally what is the point in pushing the data to Dataiku and then taking the data out. This is not an ideal pattern. Can you describe exactly what are you trying to achive? Can you use Cloud resources? If not, why not?
Well i am trying to have a proper database, with indexed values to have a high io search. I am able to use anything that has a free fersion, as i am not working for an enterprise in this case. Cloud services is a possibility, as it is not condidential data.
Dataiku is part of the process to make sure the data is in the correct format. As there is a mix of separators (; and , and others) and some others wrong datas inside.
What do you suggest ? In a reasonnable way, i can use trial versions of course.
What you haven't done is explain exactly what are you trying to achieve. What data are you trying to load? What;'s the purpose of the project? Why does it need to come out Dataiku?
It is actually text lines that are separated, a simple csv file, in 2 column. And in dataiku as I said it is cleaning up to delete the lines that are not fit / not the correct parsed data.
And dataiku recognizes correctly the parsed data (as example : email, IP, etc) , and it process very fastly, the only problem is the download, which takes forever. And still in the download process it is the same data but well organised and clean, some lines are deleted because they are incorrect which is ok.
All in all, I just want to speed up the download (final result csv file) or just put the result in a database (for indexing purposes). I am not here to disclose personnal content 😅
Nobody is asking to disclose personal content but you haven't said why you need to download the file. Makes no sense to me. Have you tried a sync recipe directly in a SQL database? Also if you have the data cleaned in a file system dataset then why don't you try to load that file directly from the OS rather than download it from a browser (which is really slow). The files Dataiku save (typically out-s0.csv.gz) are actually TSVs (tab delimited values) gzipped.
So i guess i will take a look at snowflake / gcp / google cloud as you said, and offload in an oracle database, but that means i will have to renew a dataiku test license to see if that's fit to my project lab.