Pig scripts directly with AWS S3 inputs

Romain_NIO Registered Posts: 12 ✭✭✭✭

I recently installed Dataiku DSS Community Edition to test DSS with on an Hadoop Cluster (AWS EMR). All datasets will be stored on S3 buckets.

When I import my dataset into DSS, "pig" is disable : "pig not available : dataset is not on HDFS"

Is-it planned in future releases to allow DSS user to execute pig scripts with S3 inputs ? (with pig native S3 loaders)

At the moment, my workaround is to use "sync" module to copy datasets from S3 to HDFS.

Thanks :)

Best Answer


  • Romain_NIO
    Romain_NIO Registered Posts: 12 ✭✭✭✭
    Hi Florian,

    Thank you for your answer.

    Our datasets will be in different formats (according to the project), I assume it will be JSON, CSV or AVRO.

    One question about DSS and data cleaning, is there a "default" DSS format ? I mean when I sync a Mysql database to HDFS or a CSV to HDFS (or when applying a "prepare" recipe), is there a default format applied like avro or something ?
  • UserBird
    UserBird Dataiker, Alpha Tester Posts: 535 Dataiker
    Hi Romain,

    When you create a new HDFS dataset, by default it is csv. However, you can change the setting in the dataset format section.
Setup Info
      Help me…