I have attached my EMR with Dataiku for data processing. I have big data stored in S3. I want to test the performance of the EMR cluster as my data size increases. Is there a way to break down my dataset from the Dataiku side so that I can test my EMR with different data sizes?
I think you're looking for the sampling recipe. You could create 3 sampling recipes from your original dataset extracting (for example) 1%, 5% and 30% of your input dataset (and outputing them to S3 which is the prefered source for EMR), and use any of these sampled datasets for your EMR tests.