I recently installed Dataiku DSS Community Edition to test DSS with on an Hadoop Cluster (AWS EMR). All datasets will be stored on S3 buckets.

When I import my dataset into DSS, "pig" is disable : "pig not available : dataset is not on HDFS"

Is-it planned in future releases to allow DSS user to execute pig scripts with S3 inputs ? (with pig native S3 loaders)

At the moment, my workaround is to use "sync" module to copy datasets from S3 to HDFS.

Thanks :)

    Hi Florian,

    Thank you for your answer.

    Our datasets will be in different formats (according to the project), I assume it will be JSON, CSV or AVRO.

    One question about DSS and data cleaning, is there a "default" DSS format ? I mean when I sync a Mysql database to HDFS or a CSV to HDFS (or when applying a "prepare" recipe), is there a default format applied like avro or something ?
    Hi Romain,

    When you create a new HDFS dataset, by default it is csv. However, you can change the setting in the dataset format section.
