Pig scripts directly with AWS S3 inputs

Romain_NIO · ‎08-15-2015

Hi,

I recently installed Dataiku DSS Community Edition to test DSS with on an Hadoop Cluster (AWS EMR). All datasets will be stored on S3 buckets.

When I import my dataset into DSS, "pig" is disable : "pig not available : dataset is not on HDFS"

Is-it planned in future releases to allow DSS user to execute pig scripts with S3 inputs ? (with pig native S3 loaders)

At the moment, my workaround is to use "sync" module to copy datasets from S3 to HDFS.

Thanks 🙂
Romain.

FlorianD · ‎08-17-2015

Hi Romain,

Sure it makes sense. In your case, what is the format of the files stored in S3 ?

View solution in original post

FlorianD · ‎08-17-2015

Hi Romain,

Sure it makes sense. In your case, what is the format of the files stored in S3 ?

Romain_NIO · ‎08-18-2015

Hi Florian,

Thank you for your answer.

Our datasets will be in different formats (according to the project), I assume it will be JSON, CSV or AVRO.

One question about DSS and data cleaning, is there a "default" DSS format ? I mean when I sync a Mysql database to HDFS or a CSV to HDFS (or when applying a "prepare" recipe), is there a default format applied like avro or something ?

UserBird · ‎08-18-2015

Hi Romain,

When you create a new HDFS dataset, by default it is csv. However, you can change the setting in the dataset format section.

Pig scripts directly with AWS S3 inputs

Pig scripts directly with AWS S3 inputs

Labels

code

Hadoop

Sign up to take part

Pig scripts directly with AWS S3 inputs

Pig scripts directly with AWS S3 inputs

Labels

code

Hadoop