Thanks again for joining us on the Dataiku Community and your interest in our event this week!
This thread is your chance to ask questions, engage in friendly debate or just find out more about our Chief Technology Officer, @Clément_Stenac. He’ll be responding to questions over the course of this week, from now until Friday January 17th.
If you’re unsure about how an Ask Me Anything (AMA) works, check out this article onAsk Me Anything Guidelines in our Community Resources section to provide some background and get a sense of expectations.
Also please note that even though it’s ‘Ask Me *Anything*’ we ask that you be sure to respect the Dataiku Community Guidelines and the possibility that while it is *ask* anything, and he will answer, he may not be able to provide all details at this time.
To participate, simply hit reply, and craft your question. It can be helpful to use the @ mention functionality so that Clement receives on community notification, but we’ll all be keeping an eye out as well, so not to worry if you don’t tag him. (But it’s good practice!)
Let the questions begin!
Michael Grayson - Community Manager
Clément Stenac is the CTO and a Co-Founder of Dataiku.
He is particularly interested in big data problems, and highly constrained systems, but can also often be found replying to Answers and Intercom users to assist them in getting the most out of their Dataiku experience.
I don't use external IDEs integration. For simple editions, I stay in the recipe editor, and for more advanced modifications, I use the integrated Jupyter integration. This gives me quick iterative execution, and more importantly auto-completion.
To develop plugins, I also sometimes have a local clone and use Git integration to sync it with DSS. In this case, I use Sublime Text to edit locally.
It's also because I'm a big adept of printf-debugging and I never use debuggers. Many developers like debuggers, and for them, an external IDE is a must.
That being said, I usually spend more time developing DSS itself, than developing data projects in DSS 🙂 To develop DSS, I use Sublime Text and Eclipse (and still no debuggers)
Quite without a doubt, the addition of managed Kubernetes clusters and managed Spark on Kubernetes.
We see Kubernetes as the main platform for Elastic AI, providing elasticity and cost-efficiency to all kinds of workloads. It brings under a single fully-managed infrastructure both the in-memory workloads (Python and R recipes and notebooks and in-memory machine learning) and the distributed workloads (Spark recipes, Spark notebooks, Spark-powered machine learning).
Coupled with the managed offerings from the main cloud vendors, this makes Dataiku + Kubernetes a cohesive platform for Elastic AI.
In the coming months, Dataiku will keep expanding the reach of its Kubernetes offering, make it even more scalable and turn-key, and provide more tools for managing cloud costs.
When do technologies like Spark and Kubernetes start to become worthwhile? Is there a scale of data or processing that you need to reach to make either worthwhile? Are there any simple guidelines to help me understand when I should use them?
There are quite a few, but I would say that the biggest one is the fact that DSS is a closed platform, or a platform only for "clicker" users, not for "hardcore coders".
Dataiku strongly believes that the most successful data projects come from collaboration between diverse profiles and strives to be inclusive of all these profiles. And this include the coder data scientist persona.
Dataiku provides full liberty for coder users. Whenever Dataiku can help them, it does. Whenever it cannot, it stays out of the way.
Some of the highlights of this could be:
* When you run Python or R code in Dataiku, you are "just" running Python or R code. Dataiku does not mingle with your code, you get the "real" thing. While we add optional APIs, you are never forced into it. Generally speaking, anything that you can do "outside of Dataiku", you can do the exact same thing "inside Dataiku", but with optional added niceties like direct read/write access to your data, automatic versioning, automatic production deployment, ability to expot to business users, ...
* You can create multiple code environments, with both Python 2 and Python 3 and you can install whichever package you want. You get the "real" pip, conda or R packages management systems, so again, you can by and large install and use any package that you would use in any other context
* Dataiku allows seamless transition between coding and visual capabilities. While we give you full coding power, we strongly believe that we can help you save a lot of time by using visual recipes for less-interesting things like data preparation, while focusing code on the most added-value tasks
* We work with any Git repository, not just Github or others
* Each time you write code in DSS, it ends up as simple files within the DSS datadir. There is no obfuscation or anything around your code, it's always available.
* You can import code libraries, both for Python and R from any Git repository, and use them in any project
* The plugins capabilities in DSS allow any coder to make his work available and reusable by non-coding users in Dataiku, making your work more impactful with your colleagues
* Thanks to the Git integration capabilities, you can write large portions of your code outside of Dataiku, in your favorite IDE. Additionally, Dataiku provides deep integration with some of the most popular IDEs to help you stay in your favorite workflow: R Studio, Pycharm, VSCode or Sublime Text