Is your project ready for collaboration in 2020? One of the top three trends in data analytics that we’ll see more of in 2020 is a convergence of data science and business intelligence teams (visit the Dataiku blog to find out more). This level of collaboration requires pre-planning.
Dataiku DSS has built-in tools to help you on your path to easy and snag-free collaboration. We surveyed some Dataiku insiders to create this collection of less obvious collaboration tips. In this article, you'll learn about the following best practices and more:
Where to add comments and descriptions so that collaborators can find them
Why you should avoid renaming a dataset
Editing the column schema to add column descriptions and comments
Creative ways to use tags
Letting users navigate around in the Flow by clicking on words in a wiki
Turning custom code into plugins that non-coders can use
Write informative project names
Project names are less restrictive than names for datasets and recipes and so it is a good idea to be explicit, i.e., Data Ingestion instead of p001_data_ingestion. An informative project name lets others know what the project is about. You can add a description that includes the intent of the project, the name of the owner who created the project, the date they created the project, and even a version. You can also set the project status, such as “draft”, and add tags.
Each project has a name and a unique project ID. Project names can be changed, but project IDs cannot.
Adopt a dataset naming convention
Some storage connections require dataset names that follow specific rules, yet you still want those who use your project to be able to understand the purpose of each dataset and recipe in a Flow. Therefore, a naming convention is one of the most important best practices. When your datasets and recipes follow a naming convention, you can more easily recover your previous work, share your work with others, and understand quickly what your colleagues are working on.
Adopt a naming convention that meets the data connection requirements while still maintaining readability. Datasets and recipes should be self-explanatory and short. The name should indicate the element's job in the Flow.
Dataiku DSS creates names by default by appending the name of the operation to the input’s name. While this keeps things simple, it also makes objects in your Flow hard to read.
Suggested naming convention
The following guidelines maintain naming compatibility with storage connections such as SQL dialects, HDFS, and Python data frame columns:
Use only alphanumeric characters and the underscore (“_”) character,
Use only lowercase letters,
Do not use spaces,
Do not begin with a number, and
Use prefixes and suffixes (optional)
You can adopt prefixes and suffixes for your datasets, e.g., foo_t for a dataset in a SQL database, and foo_hdfs for an HDFS dataset.
Name your dataset based on its purpose in your project and how it is different from other datasets. For example, use foo_raw and foo_clean to help collaborators understand that the two datasets represent raw input data and clean output data. Apply these same best practices when naming columns, notebooks, and recipes.
Renaming a Dataset
Renaming datasets is not an officially supported operation. Since many elements in DSS and the physical data locations reference dataset names, inconsistencies can occur.
It is best to name a dataset during creation. You should avoid having to rename your datasets. If you do want to rename a dataset, there is a danger zone for you. You can find the danger zone by selecting your dataset, and then visiting the Advanced tab under Settings.
Strategically add comments and descriptions to document your project
Imagine a colleague handing off their project to you. What descriptions would you want to see? Giving collaborators a brief description in strategic locations helps them remember why the project was created so they can be more productive.
Try adding comments and descriptions to these strategic locations in your project:
A description on the project homepage. You can add links to datasets, recipes, or any element of the project.
A description in the “summary” tab of a dataset or recipe.
Edit column details to add a short comment or description.
Add comments in the code of your custom recipes. Explain what you intend to do and what the code will be used for.
Tag each element in your Flow. This helps everyone identify, at a glance, the role of each part of the Flow. You can also tag elements with the name of the person who is responsible for it. You can even assign a color to your tags so that you can assign a meaning, e.g., assigning the color red to indicate urgent.
Suggestions for applying tags
Thematic tags: tag branches dedicated to specific tasks (e.g. “insights”, “preprocessing”), tag inputs as “sources",
Tag the parts of your flow that are scheduled to run automatically, so you know changing them will affect a production workflow,
Progress status: work in progress, done, and in production,
Tag a collaborator to draw their attention to that part of the flow.
Create a wiki article (or two)
Invite collaborators to add to the project’s wiki. A wiki can let team members know why the project was created and exactly how the data was prepared. A wiki contains one or more articles and displays on the project’s home page. You can use Markdown and HTML/CSS and even start with a pre-formatted template. In addition, by referencing the project objects within the wiki, collaborators only need to click to view it.
Start a discussion
Using the Discussions tool can help all project users and viewers quickly collaborate about anything in the project Flow. You can start a discussion in any Dataiku DSS object, and follow it by visiting that object, in your centralized Dataiku DSS inbox, or by receiving email notifications. To start a discussion, click the Discussion icon.
Share findings through dashboards
Use dashboards to share findings with your team including collaborators who have read-only access to your project. On your project insights page, you can see all insights including charts, web apps, and model reports. With a few clicks, you can publish insights to a dashboard.
Write and share code samples
If you find yourself repeatedly writing similar portions of code, consider writing a code sample. Any team member can then click to add the code snippet, saving time for all collaborators.
Turn custom code into plugins
Dataiku’s integration of code allows you to accomplish anything within the platform through custom code. Plugins allow you to extend the Dataiku GUI by sharing your custom code. Any of the following can be included in a plugin: recipes, datasets, partitioned datasets, web apps, and machine learning algorithms.
Now that you've learned about how to set your project up for collaboration success, you can visit the following resources to find out more about collaborative data science using DSS!