Hi Dataiku Community! This is Kenji and I am here to answer your questions about Dataiku DSS 8.0, which we recently launched with a number of new features to help improve your data journey.
You may have a lot of questions about this release (and future ones), so I will do my best between now and September 11th to answer all of your questions and hear all of your feedback.
Before getting started, check out the DSS 8.0 release notes if you're not already familiar with it - and if you're new to AMA’s, please review the Ask Me Anything Guidelines and the Dataiku Community Guidelines.
To participate, simply hit reply and craft your question. Be sure to tag me, @kenjil, so I can be notified of your post. I’ll be keeping an eye out as well, so not to worry if you forget to tag me (but it’s good practice!).
Let the questions begin!
As VP Product at Dataiku, Kenji oversees the product roadmap and the user experience of the Dataiku DSS Entreprise AI Platform. He holds a PhD in pure mathematics from University of Paris VII, directed documentary movies before switching to Data Science and product management.
Thanks for being willing to take our questions.
This new feature looks promising for hiding large parts of the complexity of certain projects or potentially for gathering up and sharing parts of flows with less knowledgeable DSS users.
In fooling around with zones.
I'm wondering at what point in my flows I should start creating new zones.
I see when I create my first zone. A default zone is created for anything that I did not choose to put into a newly created zone.
What are the inputs/outputs from zones? It looks like it might be a dataset.
Zones can be added or removed at any point. Working with them is a personal choice. They were designed to help manage large flows, but I see our data science people using them on virtually all projects, regardless of size. They can be useful to document the stages of a pipeline, or to delimit a specific experiment or personal working space. For example, you can add descriptions and tags for zones in the right hand details panel, just like other flow objects.
If you have an existing flow without zones, use the Flow Zones View mode from the dropdown in the bottom left corner of the screen. This has a Hide Zones / Show Zones toggle - I suggest hiding the ones while refactoring an existing project.
You can add lots of empty zones into a flow, but they are automatically laid out according to the relationships between them. And when they are empty, they will just line up along the bottom of the screen in an effectively random order.
Yes, when you add your first zone you will be given a Default Zone. You can't delete this, but you can rename it. The Default zone will be deleted when the last explicitly created zone is deleted. Personally I use it as my left-most zone, containing all my input datasets. As you say, anything not assigned explicitly to another zone is placed here.
The input and outputs of zones are stores, rather than recipes - the same items you can share between projects. It can be useful to think of zones in a flow as analogous to a set of linked projects in a project folder. In particular, notice the Share to Flow Zone option on the right click context menu of a dataset. This allows you to pin a reference to a dataset into another zone. For example, imagine you have a massive flow and you want to do a little bit of experimentation. You want to use three existing datasets as the input data for the experiment. You can share the three datasets into your experimental zone. Then you can maximise that zone and you will see your three datasets ready for you to use.
I've been able to create a few zones in my projects. It seems like a bit of work for smaller projects.
Can you say a bit more about the value of flow zones:
What is the significance of outputs of zones being stores rather than recipes? I have not yet used linked projects. Can you say a bit more about this? Why is this important? I feel like I'm missing something here.
Hi Tom, we have just announced new tutorials for Dataiku 8 features at: https://community.dataiku.com/t5/Academy-Discussions/New-tutorials-for-Dataiku-8-features/m-p/9865, including material for Flow Zones.
Regarding applications, the docs say "business users can create their own instances of the application, set parameters, upload data, run the applications, and directly obtain results." - what level of user licence is required to do this? Can an explorer do all of the above, or will their experience with an application be limited in any way due to the scope of their licence?
Similarly, are there any considerations around permissions to a project or it's datasets before packaging an app, or does packaging it up automatically make everything available to any Explorer who is permission into the underlying project?
Thanks for this, can't wait to get my hands on V8!
Concerning user profiles: Designers can both create and execute applications. Explorers can execute applications.
Concerning permissions: You can control the group of users that can execute the applications (and they get access to the data in the application instance).
The new feature about "Tag Categories" looks very interesting in terms of improving the Governance within DSS, specially in crowded nodes. Do you have any concrete examples on how could we use them? I've gone through the documentation notes, but they are not completely clear for me.
Tag categories greatly improve the tagging capabilities of DSS by enforcing some constraints.
You can configure DSS so that it suggests users to tag some objects in DSS (Datasets, Recipes and Notebooks for example), and then only accepts a limited sets of tags for this category.
For example, a DSS administrator can create a category “team” with the names of the various teams in the company to monitor their use of DSS. This will ensure that all projects can be tagged with one (or more) teams and avoid misspelled team names.
You can check out this link for to a tutorial on tags + tag categories: New tutorials for Dataiku 8 features
I was particularly excited to see the new application as recipe feature. I am looking forward to being able to explore that feature once we have upgraded to v8. In the meantime, though, I have a few questions about this feature.
When sharing an application as recipe, two options are given - copying project and creating a plug in. To be able to select the application as recipe from the new recipes menu, I'd think that one would need to create plug in. Is that right? If so, how would one use it via the copying the project option?
Does an application as recipe require an input? I can think of a number of use cases for our situation where we could standardize the pulling of a set of data (eligible members, utilization counted in certain way, etc.). Typically we use SQL Script recipes without dataset inputs to combine data from multiple data warehouse SQL tables to produce these results (otherwise we'd be creating datasets pointing to existing tables for easily a dozen or more tables). So the application as recipe wouldn't have any inputs in these uses.
How are the inputs used? Are they synced to the corresponding input dataset in the recipe? Or do they replace the inputs in the recipe?
I assume it's OK if the input datasets have compatible schemas rather that equivalent schemas? For example if a dataset has 20 columns but only 3 are used in the recipe it would be OK to use a dataset as an input as long as those 3 columns were present.