Seeking Guidance on Utilizing Dataiku for Data Integration and ETL Processes
I am reaching out to seek guidance and advice on utilizing Dataiku for data integration and ETL (Extract, Transform, Load) processes. As a member of this vibrant community, I am eager to learn from your experiences and expertise in working with Dataiku.
I have recently started using Dataiku as a data integration and ETL tool for my organization. While I have a basic understanding of the platform, I am looking to expand my knowledge and discover best practices from those who have already delved into its intricacies.
Specific Questions:
- How can Dataiku be effectively used for data management tasks? What are the key features and functionalities I should be aware of?
- What are some recommended strategies for ETL processes within Dataiku? Are there any pitfalls or challenges to watch out for?
- Are there any third-party plugins or extensions that can enhance Dataiku's capabilities for data integration and ETL?
- What are some real-world use cases or success stories that demonstrate the power and versatility of Dataiku in this domain?
- Are there any particular resources, tutorials, or online communities you would recommend for further learning about Dataiku's data integration and ETL capabilities?
Operating system used: Windows
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
My general advice will be: don't do it.
It is well established fact that most ML projects require a considerable amount of data engineering. It's so well understood that a general rule was described and talked about many times:
https://www.ibm.com/cloud/blog/ibm-data-catalog-data-scientists-productivity
https://towardsdatascience.com/the-80-20-challenge-7b8bfb643947
https://www.datagym.ai/the-80-20-data-science-dilemma/
It is natural then that for Dataiku to be effective ML platform it will have to be good at data engineering. But the fact that is good at that doesn't mean to is meant to replace traditional ETL/DI tools. In fact I would argue that doing so will lead to solutions that will be far from optimal. Dataiku is extremely good a rapid prototyping data pipelines and allows both coders and non-coders to quickly develop complex solutions. But it can also become a burden if used incorrectly. Here are the reasons as to why Dataiku is not a good standalone DI/ETL solution:
- The Data Scientist license is way more expensive than a comparable DI/ETL license. This is because you are meant to be doing machine learning models Dataiku, not just ETL
- Visual recipes provide a step by step data transformation flow but can often become too verbose making the flow overly complex and maintenance and support harder
- There is a fine balance between code recipes and visual recipes, most people get this wrong
- While persisting each transformation step is a great idea while you are developing your data transformation flow it's also extremely inneficient, both in terms of storage and performance