How to load a large HTTP dataset in DSS ?
UserBird
Dataiker, Alpha Tester Posts: 535 Dataiker
Hi,
I actually work on Dataiku build on a cluster.
I want to load an HDFS dataset from an HTTP file (Zip contains csv). I use without problem the Network dataset interface on DSS. Problem is when I want to load a large csv file on HTTP (8Go). Dataiku can't detect it and returns me the Preview of the json.
It is possible to load directly a large files in DSS or I need to make a Recipes for this big file ?
Another solution ?
I actually work on Dataiku build on a cluster.
I want to load an HDFS dataset from an HTTP file (Zip contains csv). I use without problem the Network dataset interface on DSS. Problem is when I want to load a large csv file on HTTP (8Go). Dataiku can't detect it and returns me the Preview of the json.
It is possible to load directly a large files in DSS or I need to make a Recipes for this big file ?
Another solution ?
Tagged:
Answers
-
I'm sorry I'm not sure I understand exactly your problem.
What do you call a HTTP file?
Is it a file that is currently on a remote server? In that case you can connect to it to create a dataset, then sync it to HDFS.
Is it a file that currently is on your local computer? In that case, I guess because of its size the upload does not work and you should try scp ing the file directly on the server, then creating a filesystem dataset pointing to that file. -
Hi cperdigou,
My csv file is stocked online in a remote server (into a zip file : http://files.data.gouv.fr/sirene/). When I use the Network dataset interface on DSS I have no problem connecting into a smaller csv (100Mo). But with a 8Go file I can't connect to it to create a dataset cause it's too big.
Dataiku is installed on a VM with a cluster Hadoop. Maybe can I use the Hadoop powerfull to load this big datafile into Dataiku ?