Community Conundrums are live! Learn more

Pig scripts directly with AWS S3 inputs

Level 2
Pig scripts directly with AWS S3 inputs
Hi,

I recently installed Dataiku DSS Community Edition to test DSS with on an Hadoop Cluster (AWS EMR). All datasets will be stored on S3 buckets.

When I import my dataset into DSS, "pig" is disable : "pig not available : dataset is not on HDFS"

Is-it planned in future releases to allow DSS user to execute pig scripts with S3 inputs ? (with pig native S3 loaders)

At the moment, my workaround is to use "sync" module to copy datasets from S3 to HDFS.

Thanks 🙂
Romain.
0 Kudos
3 Replies
Dataiker
Dataiker
Hi Romain,

Sure it makes sense. In your case, what is the format of the files stored in S3 ?
Level 2
Author
Hi Florian,

Thank you for your answer.

Our datasets will be in different formats (according to the project), I assume it will be JSON, CSV or AVRO.

One question about DSS and data cleaning, is there a "default" DSS format ? I mean when I sync a Mysql database to HDFS or a CSV to HDFS (or when applying a "prepare" recipe), is there a default format applied like avro or something ?
0 Kudos
Dataiker
Dataiker
Hi Romain,

When you create a new HDFS dataset, by default it is csv. However, you can change the setting in the dataset format section.
0 Kudos
Labels (2)