Can I limit the size of my data volume input

Robel
Robel Registered Posts: 3 ✭✭✭✭

Hello,

I have attached my EMR with Dataiku for data processing. I have big data stored in S3. I want to test the performance of the EMR cluster as my data size increases. Is there a way to break down my dataset from the Dataiku side so that I can test my EMR with different data sizes?

Thank you,

Robel.

Best Answer

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Answer ✓

    Hi,

    I think you're looking for the sampling recipe. You could create 3 sampling recipes from your original dataset extracting (for example) 1%, 5% and 30% of your input dataset (and outputing them to S3 which is the prefered source for EMR), and use any of these sampled datasets for your EMR tests.

Setup Info
    Tags
      Help me…