Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on June 11, 2024 2:34PM
Likes: 2
Replies: 7
As an organization, we utilize three distinct DSS instances to manage our data analytic and ML workflows:
A significant portion of our critical data products consists of datasets created through ETL processes that adhere to specific data models. These datasets are shared across the self-service and design instances. However, a critical challenge we face is the lack of metadata synchronization with the production node. The metadata of datasets generated in the automation instance, datasets which are stored and shared in a data warehouse, is inaccessible to the self-service and design instances.
We request the introduction of a feature that enables metadata synchronization (and lineage within Dataiku projects) between datasets shared across distinct DSS instances.
A feasible method for implementing this feature could involve centralizing the data catalogues in a single instance. This approach is similar to the functionality currently available in the Govern and Deployer Dataiku instances.
Implementing metadata synchronization across DSS instances would significantly enhance our data management capabilities, ensuring seamless integration and consistency of data products across the organization. This feature will streamline our workflows and improve overall productivity and data governance. Furthermore, this new feature would be useful to any other organization handling some of their ETL process in Dataiku
I see it's been a while since this was discussed, but I'm curious—has anyone found a good way to automate metadata sync across multiple DSS instances? I imagine custom APIs or some scripting might work, but I'd love to hear if anyone has a setup that works well in practice. Also, how do you handle conflicts when metadata updates happen in different instances at the same time?
You could try using an external database to store and sync metadata across instances. A central metadata repository with scheduled sync jobs might help keep everything consistent and up to date.
Yes, but that would be like using an external data catalog solution. Unless you are talking about the metadata that Dataiku uses internally, so the Data Catalogs for each DSS instance remain the same, and the lineage is also shown in the Dataiku interface.
In that sense, @Stemmkoli curiosity goes in the right direction. If the dataiku API allowed somehow to create scripts to sync the Dataiku metadata between instances, that would be great.
Wouldn't a project deployment sync your datasets metadata? I don't see the extra need to synchronize this without a deployment. Or am I missing something here?
OK, I see that is something missing in the product idea description. The idea is that in the self-service instance, the user can interact with the data products made in the automation node, using the same new lineage features that Dataiku has been introducing.
This is still no clear enough (sorry, my language barrier interfering with my ideas), let me give another example. If I go to the automation node (where many ETLs are run, and schemas and tables are created) I can create a Data Collection using the Dataiku capabilities. The Data Collection in the automation node will have all the information we have added to the tables, like comments in dataiku, stewards, last build, etc. Also, inspecting the lineage with the Dataiku interface works right away from the box.
However, the users in the self-service node are not allowed to work in the automation node. So, we need to start building a new Data Collection in the self-service node data catalog, to try to match what is visible in the automation node, but we lost all the lineage features, the dataiku comments, the stewards, the last build date, the data quality checks, etc.
If instead of running the ETLs in the automation node, we would run them in the self-service node, we wouldn't have this problem, but it would defeat the purpose of having an automation node.
Please let me know if it is still unclear.
I ran into a similar issue before, and one thing that helped was setting up a regular metadata sync using APIs. If you're trying to maintain consistency across multiple DSS instances, automating the sync with a scheduled job can reduce mismatches. It's kind of like how top DAM solutions for photographers maintain metadata integrity across different platforms, ensuring files stay organized no matter where they're accessed.
Thank you, everyone, for your comments. I have definitely made a pretty confusing description of the product idea I was proposing, by not being specific about the metadata.
I was not talking about the metadata that can be stored into the target databases (like tables and columns descriptions), but about Dataiku's own metadata associated to each dataset, which provides the information for the DSS interfaces available under "Data Catalog", "Data Collection", and "Column Lineage".
Here is a link to a video where I try to show what I'm talking about:
Notice how the "Data Collection" I create, which includes dataiku metadata about last build name, datasets comments, data quality and lineage, in the Automation node, is lost when I create a "Data Collection" now at the self-service node.
Of course, this is not a bug, I'm connecting to the new datasets by importing the prepared tables from our PostgreSQL data warehouse. But I don't see how I can communicate or synchronize (easily) the dataiku metadata about the parent flow, last build time, data quality criteria, etc.