Doing a bit of Dataiku DSS House keeping

tgb417 · September 2021

All,

My DSS is looking a bit messy right now...

I think that I need a bit of cleanup.

Context:

I'm working on a DSS instance that is coming up on its one-year anniversary. In the catalog, we have about 1500 items created by a mix of folks currently working on the project and those who have rolled off of the project over time.

I'm thinking about doing a bit of cleanup.

I think that the Catalog should be helpful in taking these steps. However, I am not clear about the best approach.

Question:

How do you do general upkeep on your DSS Instance?
How do you archive projects, recipes, Insights that are no longer in active use...
Is there a way to drop data set content without dropping project structure so if the project is needed again it could simply be refreshed?
Are folks using "Archive Folders" to get old unused projects off of the home screen?
How do you know which objects are safe to delete, particularly when there may be cross-project dependencies, that may not be done through the Exposed objects feature?
What other general maintenance things do you do to keep your DSS instances clean, particularly when it comes to design notes.

--Tom

Manuel · September 2021

Hi,

There is a mix of product features and best practices that you can use.

PRODUCT FEATURES:

Macros: Dataiku includes a series of Macros to perform maintenance tasks, such as deleting jobs and temporary files;
Exposed objects: The exposed objects functionality, specifically the sharing of datasets between projects, minimizes redundancy and space within the DSS instance and the connected data platforms.
If many projects are likely to use the same raw data and transformation recipes, then it is recommended that these steps are part of a single preparation project, with the resulting dataset share with downstream projects.
Pipelines: pipelines (Spark or SQL) enable you to execute long data pipelines with multiple steps in one go, without always having to write the intermediate data and re-read it at the next step.

BEST PRACTICES:

Purge Intermediate Datasets: 1) virtualise with pipelines or 2) Use a specific connection for intermediate datasets, dropping the datasets at agreed periods or 3) Tag datasets to be purged and create a macro that purges all datasets with that tag.
Remove old projects: Tag the projects to be removed. Create a macro to archive and delete the tagged projects
Manual cleanup: Besides the cleanup macros, there is also a list of recommended manual cleanups to manage disk usage

I hope this helps.

Best regards

tgb417 · September 2021

@Manuel
,

Thanks for the feedback these are helpful.

I’m also super interested in what others are actually doing with your DSS infrastructure. If you are running a DSS instance I’d like to invite you to share your best practices / tips and tricks for keeping your DSS infrastructure running smoothly.

—Tom

Doing a bit of Dataiku DSS House keeping

Answers

Categories

Setup Info

Tags