Questions on Dataiku, EKS, and EMR Serverless for Efficient Data Processing
Hi Dataiku Community,
I hope you're all doing well. I wanted to reach out with some questions regarding our current implementation, where we are utilizing EMR Serverless with TBs of data flowing between Snowflake and S3.
- Intermediate Datasets Avoidance: We are looking into EKS setup compared to EMR Serverless. How does Dataiku handle complex join recipes in EKS, and does it necessitate the use of more intermediate datasets?
- Scaling Considerations: In our current setup, we use 70 nodes of r5.8x machines. Can Dataiku efficiently manage and orchestrate heavy data loads on EKS? Moreover, is there a cost comparison available between the two setups?
- Job Queuing and Node Management: How does Dataiku provide queuing of jobs, node management, and orchestration for scenarios with heavy data loads?
- EKS Startup Time vs. EMR Serverless: We've noticed EKS has a startup time for creation, which is significantly higher than EMR Serverless. Does Dataiku have a feature similar to warmed instances, akin to EMR Serverless?
- Dynamic Cluster Creation and Destruction: For use case 1, where we need to create and destroy clusters based on usage scenarios, how can this be efficiently managed within Dataiku DSS?
- Partitioning Column Handling: In use case 2, suppose we have a source table with 5 columns and we want to partition the output dataset based on a date column (e.g., part_date). Is there a way to keep the part_date in the output dataset without using the Prepare Recipe?
- Binary Data Type Handling: In use case 3, one of our tables has a column of type Binary. However, DSS seems to convert it to a string by default. Is it possible to treat binary columns as-is in Dataiku?
- Resource Management Across AWS Accounts: Currently, our EKS and Dataiku instances reside in different AWS accounts. When we delete a cluster in Dataiku, it seems to leave resources orphaned in the AWS account. Could you provide suggestions on how we can tightly integrate Dataiku to manage deletions and avoid orphaned resources?
- Support for EMR Serverless: Is there any information available about direct support for EMR Serverless in the near future? We are keen to know about upcoming features and integrations.
I appreciate any insights or advice from the community. Your experiences and recommendations will be invaluable in guiding our decisions. Thank you in advance!
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,166 Neuron
This question is too long and too wide to be answered directly. I think that you should engage with Dataiku Profesional Services as the depth and breath required to answer all your points is quite substantial and the sort of decisions you are making based on the information you are looking for will have a big impact on your architecture and the outcomes it provides. In other words this is not the sort of questions you want to leave to people in a community forum, no matter how good the quality of the forum is which in this case is quite high.
I would say that in general you shouldn't be looking to replace your ETL/DWH/Data Lake/Big Data store with Dataiku. If you aim for that you will most likely fail. Dataiku is an excellent end-to-end ML platform but it's not a silver bullet for everything else. If you have something that works already then leave all of that complexity outside Dataiku and just bring the data in its most ready state possible to be used for Machine Learning inside Dataiku. That's where you are going to get the most value out of Dataiku.