Can I limit the size of my data volume input

Solved!
Robel
Level 1
Can I limit the size of my data volume input

Hello,

I have attached my EMR with Dataiku for data processing. I have big data stored in S3. I want to test the performance of the EMR cluster as my data size increases. Is there a way to break down my dataset from the Dataiku side so that I can test my EMR with different data sizes?

 

Thank you,

Robel.

1 Solution
Clรฉment_Stenac

Hi,

I think you're looking for the sampling recipe. You could create 3 sampling recipes from your original dataset extracting (for example) 1%, 5% and 30% of your input dataset (and outputing them to S3 which is the prefered source for EMR), and use any of these sampled datasets for your EMR tests.

View solution in original post

1 Reply
Clรฉment_Stenac

Hi,

I think you're looking for the sampling recipe. You could create 3 sampling recipes from your original dataset extracting (for example) 1%, 5% and 30% of your input dataset (and outputing them to S3 which is the prefered source for EMR), and use any of these sampled datasets for your EMR tests.