Pig scripts directly with AWS S3 inputs
Romain_NIO
Registered Posts: 12 ✭✭✭✭
Hi,
I recently installed Dataiku DSS Community Edition to test DSS with on an Hadoop Cluster (AWS EMR). All datasets will be stored on S3 buckets.
When I import my dataset into DSS, "pig" is disable : "pig not available : dataset is not on HDFS"
Is-it planned in future releases to allow DSS user to execute pig scripts with S3 inputs ? (with pig native S3 loaders)
At the moment, my workaround is to use "sync" module to copy datasets from S3 to HDFS.
Thanks
Romain.
I recently installed Dataiku DSS Community Edition to test DSS with on an Hadoop Cluster (AWS EMR). All datasets will be stored on S3 buckets.
When I import my dataset into DSS, "pig" is disable : "pig not available : dataset is not on HDFS"
Is-it planned in future releases to allow DSS user to execute pig scripts with S3 inputs ? (with pig native S3 loaders)
At the moment, my workaround is to use "sync" module to copy datasets from S3 to HDFS.
Thanks
Romain.
Tagged:
Best Answer
-
Hi Romain,
Sure it makes sense. In your case, what is the format of the files stored in S3 ?
Answers
-
Hi Florian,
Thank you for your answer.
Our datasets will be in different formats (according to the project), I assume it will be JSON, CSV or AVRO.
One question about DSS and data cleaning, is there a "default" DSS format ? I mean when I sync a Mysql database to HDFS or a CSV to HDFS (or when applying a "prepare" recipe), is there a default format applied like avro or something ? -
Hi Romain,
When you create a new HDFS dataset, by default it is csv. However, you can change the setting in the dataset format section.