Community Conundrum 25:Feature Visualization is now live! Read More

Can I limit the size of my data volume input

Level 1
Can I limit the size of my data volume input

Hello,

I have attached my EMR with Dataiku for data processing. I have big data stored in S3. I want to test the performance of the EMR cluster as my data size increases. Is there a way to break down my dataset from the Dataiku side so that I can test my EMR with different data sizes?

 

Thank you,

Robel.

1 Reply
Dataiker
Dataiker

Hi,

I think you're looking for the sampling recipe. You could create 3 sampling recipes from your original dataset extracting (for example) 1%, 5% and 30% of your input dataset (and outputing them to S3 which is the prefered source for EMR), and use any of these sampled datasets for your EMR tests.