Thanks again for joining us on the Dataiku Community and your interest in our event this week!
This thread is your chance to ask questions, engage in friendly debate or just find out more about our Chief Technology Officer, @Clément_Stenac. He’ll be responding to questions over the course of this week, from now until Friday January 17th.
If you’re unsure about how an Ask Me Anything (AMA) works, check out this article onAsk Me Anything Guidelines in our Community Resources section to provide some background and get a sense of expectations.
Also please note that even though it’s ‘Ask Me *Anything*’ we ask that you be sure to respect the Dataiku Community Guidelines and the possibility that while it is *ask* anything, and he will answer, he may not be able to provide all details at this time.
To participate, simply hit reply, and craft your question. It can be helpful to use the @ mention functionality so that Clement receives on community notification, but we’ll all be keeping an eye out as well, so not to worry if you don’t tag him. (But it’s good practice!)
Let the questions begin!
Michael Grayson - Community Manager
Clément Stenac is the CTO and a Co-Founder of Dataiku.
He is particularly interested in big data problems, and highly constrained systems, but can also often be found replying to Answers and Intercom users to assist them in getting the most out of their Dataiku experience.
I'd say Java. It's not very fancy, it is fairly verbose and yes it is often boring to write.
However, there are tons of advantages to the language. First, of course, it's really fast, thanks to a virtual machine that is probably the best on earth, with more than 20 years of fine-tuning. But what I like most is this boring and not-fancy aspect.
In Java, writing code is not the most neat and intellectually rewarding task. You don't get the warm-glow feeling of writing something in a neater, more poetic or more balanced way.
However, reading code in Java is extremely easy. There is no magic, no hidden behavior, no wondering "gee, what kind of variable is this", no worrying about "will this operator be overloaded?" or "will this innocuous looking thing actually perform a hidden network call?".
Let's face it, we actually spend more time reading code than writing code. And even if fancy languages have great IDE support, you're not always reading code in nice conditions with your IDE well setup and performing type inference and cross-navigation for you. You may be reading a PR on Github, or frantically pulling up the code on your mobile phone while debugging a production issue in a train.
You want more people than you to be able to read and debug your code (else, you'll do all the debugging yourself 🙂 ). In my opinion, simplicity, absolute lack of surprise and readability trumps the potential productivity gains of more advanced languages.
And my experience is that even if you can have reasonable compromises, more advanced languages tend to "reward" more complicated code that becomes less readable. It's a hard discipline. So while I do enjoy the magic of programming in Scala, I still prefer Java.
Of course, the answer would be very different if I was a data scientist (it would be more about snakes than a language whose name is a single letter). But I am not 🙂
There are quite a few, but I would say that the biggest one is the fact that DSS is a closed platform, or a platform only for "clicker" users, not for "hardcore coders".
Dataiku strongly believes that the most successful data projects come from collaboration between diverse profiles and strives to be inclusive of all these profiles. And this include the coder data scientist persona.
Dataiku provides full liberty for coder users. Whenever Dataiku can help them, it does. Whenever it cannot, it stays out of the way.
Some of the highlights of this could be:
* When you run Python or R code in Dataiku, you are "just" running Python or R code. Dataiku does not mingle with your code, you get the "real" thing. While we add optional APIs, you are never forced into it. Generally speaking, anything that you can do "outside of Dataiku", you can do the exact same thing "inside Dataiku", but with optional added niceties like direct read/write access to your data, automatic versioning, automatic production deployment, ability to expot to business users, ...
* You can create multiple code environments, with both Python 2 and Python 3 and you can install whichever package you want. You get the "real" pip, conda or R packages management systems, so again, you can by and large install and use any package that you would use in any other context
* Dataiku allows seamless transition between coding and visual capabilities. While we give you full coding power, we strongly believe that we can help you save a lot of time by using visual recipes for less-interesting things like data preparation, while focusing code on the most added-value tasks
* We work with any Git repository, not just Github or others
* Each time you write code in DSS, it ends up as simple files within the DSS datadir. There is no obfuscation or anything around your code, it's always available.
* You can import code libraries, both for Python and R from any Git repository, and use them in any project
* The plugins capabilities in DSS allow any coder to make his work available and reusable by non-coding users in Dataiku, making your work more impactful with your colleagues
* Thanks to the Git integration capabilities, you can write large portions of your code outside of Dataiku, in your favorite IDE. Additionally, Dataiku provides deep integration with some of the most popular IDEs to help you stay in your favorite workflow: R Studio, Pycharm, VSCode or Sublime Text
Quite without a doubt, the addition of managed Kubernetes clusters and managed Spark on Kubernetes.
We see Kubernetes as the main platform for Elastic AI, providing elasticity and cost-efficiency to all kinds of workloads. It brings under a single fully-managed infrastructure both the in-memory workloads (Python and R recipes and notebooks and in-memory machine learning) and the distributed workloads (Spark recipes, Spark notebooks, Spark-powered machine learning).
Coupled with the managed offerings from the main cloud vendors, this makes Dataiku + Kubernetes a cohesive platform for Elastic AI.
In the coming months, Dataiku will keep expanding the reach of its Kubernetes offering, make it even more scalable and turn-key, and provide more tools for managing cloud costs.
When do technologies like Spark and Kubernetes start to become worthwhile? Is there a scale of data or processing that you need to reach to make either worthwhile? Are there any simple guidelines to help me understand when I should use them?
Technologies like Spark are invaluable in order to reach scalability, both in terms of data and overall computation load. However, they do come with significant entry barriers (even though Dataiku is working hard at reducing them).
In our experience, while there is no fixed threshold, you'll likely need to reach in the multiple millions of records to process before you may need to switch away from simple DSS engine processing.
Even if not the most fashionable, a very simple PostgreSQL database (you can get managed options very easily) will allow you to scale to significant levels of data and processing before you need to go into distributed options like Spark. You will often be able to handle dozens of gigabytes on such a non-distributed infrastructure.
I don't use external IDEs integration. For simple editions, I stay in the recipe editor, and for more advanced modifications, I use the integrated Jupyter integration. This gives me quick iterative execution, and more importantly auto-completion.
To develop plugins, I also sometimes have a local clone and use Git integration to sync it with DSS. In this case, I use Sublime Text to edit locally.
It's also because I'm a big adept of printf-debugging and I never use debuggers. Many developers like debuggers, and for them, an external IDE is a must.
That being said, I usually spend more time developing DSS itself, than developing data projects in DSS 🙂 To develop DSS, I use Sublime Text and Eclipse (and still no debuggers)