A community thrives on the basis of its great members - so let's take a look at some of them shall we? From time to time we will be highlighting a prominent member of the Community, and sharing their story and DSS accomplishments!
Meet Mani - known here as @neomatrix369. We sat down with him and had a chat about his story with Dataiku DSS.
1) How did you find Dataiku and get started with DSS?
I got to know about Dataiku in about 2018 when I heard someone else using it talk about it. And then I decided to download it and play with it. Although I really started working with it in 2019 and further in 2020. At that time I was also learning about various AI/ML/DL topics particularly about data and so data related tools became my primary focus and then I started gathering information about them on this page and it led to creating a dedicated page for Dataiku. And that led me to also attend a number of events and talks organised by the Dataiku Community team, including the EGG London conference in London, UK in 2019.
I was using DSS regularly as the Free/Community edition is quite elaborate that I could do many of my analysis and experiments in it and get quick results.
And this also helped during the time when I was learning about data topics and into diving deep into things like Data Preparation, Data Cleaning, Feature Engineering, and the likes. DSS offers many such features that help a Data Scientist/Machine Learning Engineer accomplish their tasks easily.
I wanted to learn how to do these things and know in-depth about these topics - I do also recollect that while doing this I was also preparing for a talk on data which you can find here. So my quest for practical knowledge led to looking for tools, learning them, using them and writing about them on my github repo.
One of the other reasons I got interested in DSS is that it is Java/JVM based and supports multiple programming languages when creating notebooks or plugins.
2) What's your favorite DSS feature?
I like the summarized visualizations (also the Visual Exploratory Data Analysis section) shown when a dataset is loaded. You get two views of it, one at the column level and the other on the table level. I have seen in the latest version (version 7.0) this is expanded into a separate section with lots more visualizations. But I have a few more favorites in DSS which I often look at for another perspective about the dataset I’m using:
- The Lab section is great
- helps create quick models for validation purposes
- AutoML wizard is also very useful
- Post model training analysis sections
3) Tell us about your projects!
I’m working on multiple projects but to name a few:
I recently developed an NLP library which will become a part of the Better NLP library eventually, it’s called NLP Profiler - what is it? Think of the pandas’s describe() but for analysing a text column. Pandas’s describe() only works on numeric columns, extracting descriptive statistics about the numerical data in the various columns of the dataset but there isn’t anything available to generate the same for text data in this manner. And so I went ahead and wrote one, although at the moment it does only basic analysis and not yet equipped to handle data at scale among many other small features I’d like to add. But it is work in progress, as I have been using it for multiple occasions, one of them being at a Kaggle task, see my kernel.
Below are a couple of private DSS projects I have been working on when competing in online DS/ML competitions like the below:
DSS helped me create submissions from it, and compare my other submissions created manually or with the help of another colleague working with me on this competition. We found that for the specific dataset, our results didn’t differ much but DSS offered a more systematic way to load data, process, setup the model, train the model, and generate the submission dataset real quickly. It was great to see how on a low-spec laptop, DSS would still seamlessly run through everything without cranking up like if we did the same thing via a Jupyter notebook (which it is not meant to be used for anyways).
March 2020: Liverpool Ion Switching
This project involves time-series data. I ended up creating a simple ensemble model mainly composed of tree-based models.
I’m also involved in two other projects:
I also gave two talks back to back during the end of June and beginning of July:
These talks cover a bigger picture of the AI/ML/DL world I live in and my journey and also my perspectives. And show how I focus on tool development, problem-solving, learning and many better practices and techniques for both Software Engineer and Data Scientists and Machine Learning Engineers.
Do you have a Dataiku story? Share how you came to use Dataiku DSS or an interesting goal you accomplished with it!