Snow Fox Data - Building a Free Plugin to Efficiently Catalog and View Data Lineage
Team members: Ryan Moore & Tony Olson
Country: United States
Organization: Snow Fox Data
Snow Fox Data is a premier data strategy, data science, and analytics solutions provider. Headquartered in Wisconsin and serving customers worldwide, we provide a vast landscape of knowledge that supports your success through data-driven decision-making. A passionate team of data architects, data scientists, data engineers, and data analysts, Snow Fox Data empowers you to make clearer decisions through clever data solutions.
At Snow Fox Data, we work with numerous customers who utilize Dataiku in their Data Science and Analytics practice. Many of these organizations and analytics groups have not yet invested in an enterprise data cataloging tool or data lineage tool, which are often cost-prohibitive.
As part of the productionalization process for these customers, we have often witnessed them creating "homegrown" data cataloging solutions that typically consist of a combination of spreadsheets, Dataiku, and their preferred visualization tool. Their “homegrown” data cataloging solutions are labor-intensive to maintain and do not integrate with their developers, who are hands-on with the Dataiku projects.
Additionally, our clients struggle with data lineage. They are creating numerous downstream datasets in Dataiku. We often experience them saying “where did that column come from?” Without upstream data lineage visibility, our clients lose trust in the data and ultimately the solution’s business outcomes.
Because of this cataloging and lineage challenge, Snow Fox Data has created a free Dataiku plugin called THREAD™. THREAD™ is a lightweight catalog and lineage tool that directly integrates with Dataiku and its datasets. This tool allows for a single location to document data connected to Dataiku and to consume the catalog's contents in a manner that is easy and efficient for business practices.
THREAD™ is implemented as a Dataiku web app plugin that has a very easy installation process and has the ability to securely scan an entire (or partial) Dataiku node to allow for lineage view and documentation. The indexes and metadata that are generated by THREAD™ are saved as Dataiku datasets in a project flow, making it very easy to export indexes and metadata for exposure in 3rd party visualization tools such as PowerBI or Tableau.
Use Case Stage: In Production
THREAD™ has already been deployed on 100s of projects at multiple joint Snow Fox Data and Dataiku clients. Here are some areas of business value THREAD™ users have obtained:
Less clicks / saves time by having the data definition at the time and location the information is needed.
More insights and improved insights during exploratory data analysis through better documented columns.
This all leads to faster solution building and data enrichment through documentation and improved data understanding.
Clear measurement of governance through KPIs showing the percent of columns documented in any data set.
Creates a repository for data documentation.
Easier to keep definitions up to date.
Allows definitions to be easily auditable (exportable).
Natively integrated with Dataiku permissions that limit editing data definitions to those with access.
Creates easy transparency for data analysts, data engineers, data scientists, and business leaders to see:
What data was used in a project (data catalog).
Where it was used (upstream/downstream data lineage).
How that data is defined throughout the project (data dictionary).
Builds a common language between the business and analysts
Training & Onboarding Efficiencies
Helps new team members learn company-specific jargon and abbreviations faster.
Streamlines onboarding and training by keeping all individuals in Dataiku instead of a myriad of spreadsheets and code documentation.
Saves Money and Labor
Saves Analytics leaders $200k+ in purchasing, implementing, and supporting an enterprise grade data catalog & data lineage tool for their Dataiku environment.
Value Brought by Dataiku:
THREAD™ is built on top of Dataiku! All the value THREAD™ creates is an extension of and possible because of Dataiku.
Dataiku’s flexible and extensible platform allows the community to contribute and share solutions across organizations and industries easily. The ability to write custom plugins and integrate with the Python API provide the capability to achieve exceptional business value through custom integrations.
The native security integration removes governance concerns on building application solutions on top of Dataiku and thus increases the speed of the innovation process.