Is your project ready for collaboration in 2020? One of the top three trends in data analytics that we’ll see more of in 2020 is a convergence of data science and business intelligence teams (visit the Dataiku blog to find out more). This level of collaboration requires pre-planning.
Dataiku DSS has built-in tools to help you on your path to easy and snag-free collaboration. We surveyed some Dataiku insiders to create this collection of less obvious collaboration tips. In this article, you'll learn about the following best practices and more:
Project names are less restrictive than names for datasets and recipes and so it is a good idea to be explicit, i.e., Data Ingestion instead of p001_data_ingestion. An informative project name lets others know what the project is about. You can add a description that includes the intent of the project, the name of the owner who created the project, the date they created the project, and even a version. You can also set the project status, such as “draft”, and add tags.
Each project has a name and a unique project ID. Project names can be changed, but project IDs cannot.
Some storage connections require dataset names that follow specific rules, yet you still want those who use your project to be able to understand the purpose of each dataset and recipe in a Flow. Therefore, a naming convention is one of the most important best practices. When your datasets and recipes follow a naming convention, you can more easily recover your previous work, share your work with others, and understand quickly what your colleagues are working on.
Adopt a naming convention that meets the data connection requirements while still maintaining readability. Datasets and recipes should be self-explanatory and short. The name should indicate the element's job in the Flow.
Dataiku DSS creates names by default by appending the name of the operation to the input’s name. While this keeps things simple, it also makes objects in your Flow hard to read.
The following guidelines maintain naming compatibility with storage connections such as SQL dialects, HDFS, and Python data frame columns:
You can adopt prefixes and suffixes for your datasets, e.g., foo_t for a dataset in a SQL database, and foo_hdfs for an HDFS dataset.
Name your dataset based on its purpose in your project and how it is different from other datasets. For example, use foo_raw and foo_clean to help collaborators understand that the two datasets represent raw input data and clean output data. Apply these same best practices when naming columns, notebooks, and recipes.
Renaming datasets is not an officially supported operation. Since many elements in DSS and the physical data locations reference dataset names, inconsistencies can occur.
It is best to name a dataset during creation. You should avoid having to rename your datasets. If you do want to rename a dataset, there is a danger zone for you. You can find the danger zone by selecting your dataset, and then visiting the Advanced tab under Settings.
Imagine a colleague handing off their project to you. What descriptions would you want to see? Giving collaborators a brief description in strategic locations helps them remember why the project was created so they can be more productive.
Try adding comments and descriptions to these strategic locations in your project:
Tag each element in your Flow. This helps everyone identify, at a glance, the role of each part of the Flow. You can also tag elements with the name of the person who is responsible for it. You can even assign a color to your tags so that you can assign a meaning, e.g., assigning the color red to indicate urgent.
Suggestions for applying tags
Invite collaborators to add to the project’s wiki. A wiki can let team members know why the project was created and exactly how the data was prepared. A wiki contains one or more articles and displays on the project’s home page. You can use Markdown and HTML/CSS and even start with a pre-formatted template. In addition, by referencing the project objects within the wiki, collaborators only need to click to view it.
Using the Discussions tool can help all project users and viewers quickly collaborate about anything in the project Flow. You can start a discussion in any Dataiku DSS object, and follow it by visiting that object, in your centralized Dataiku DSS inbox, or by receiving email notifications. To start a discussion, click the Discussion icon.
Use dashboards to share findings with your team including collaborators who have read-only access to your project. On your project insights page, you can see all insights including charts, web apps, and model reports. With a few clicks, you can publish insights to a dashboard.
If you find yourself repeatedly writing similar portions of code, consider writing a code sample. Any team member can then click to add the code snippet, saving time for all collaborators.
Dataiku’s integration of code allows you to accomplish anything within the platform through custom code. Plugins allow you to extend the Dataiku GUI by sharing your custom code. Any of the following can be included in a plugin: recipes, datasets, partitioned datasets, web apps, and machine learning algorithms.
Now that you've learned about how to set your project up for collaboration success, you can visit the following resources to find out more about collaborative data science using DSS!