Doing a bit of Dataiku DSS House keeping

tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

All,

My DSS is looking a bit messy right now...

DSS-Cleanup1-350.png

I think that I need a bit of cleanup.

Context:

I'm working on a DSS instance that is coming up on its one-year anniversary. In the catalog, we have about 1500 items created by a mix of folks currently working on the project and those who have rolled off of the project over time.

I'm thinking about doing a bit of cleanup.

I think that the Catalog should be helpful in taking these steps. However, I am not clear about the best approach.

Question:

  • How do you do general upkeep on your DSS Instance?
  • How do you archive projects, recipes, Insights that are no longer in active use...
  • Is there a way to drop data set content without dropping project structure so if the project is needed again it could simply be refreshed?
  • Are folks using "Archive Folders" to get old unused projects off of the home screen?
  • How do you know which objects are safe to delete, particularly when there may be cross-project dependencies, that may not be done through the Exposed objects feature?
  • What other general maintenance things do you do to keep your DSS instances clean, particularly when it comes to design notes.

--Tom

Tagged:

Answers

  • Manuel
    Manuel Alpha Tester, Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Dataiku DSS Adv Designer, Registered Posts: 193 ✭✭✭✭✭✭✭

    Hi,

    There is a mix of product features and best practices that you can use.

    PRODUCT FEATURES:

    • Macros: Dataiku includes a series of Macros to perform maintenance tasks, such as deleting jobs and temporary files;
    • Exposed objects: The exposed objects functionality, specifically the sharing of datasets between projects, minimizes redundancy and space within the DSS instance and the connected data platforms.

      If many projects are likely to use the same raw data and transformation recipes, then it is recommended that these steps are part of a single preparation project, with the resulting dataset share with downstream projects.

    • Pipelines: pipelines (Spark or SQL) enable you to execute long data pipelines with multiple steps in one go, without always having to write the intermediate data and re-read it at the next step.

    BEST PRACTICES:

    • Purge Intermediate Datasets: 1) virtualise with pipelines or 2) Use a specific connection for intermediate datasets, dropping the datasets at agreed periods or 3) Tag datasets to be purged and create a macro that purges all datasets with that tag.
    • Remove old projects: Tag the projects to be removed. Create a macro to archive and delete the tagged projects
    • Manual cleanup: Besides the cleanup macros, there is also a list of recommended manual cleanups to manage disk usage

    I hope this helps.

    Best regards

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    @Manuel
    ,

    Thanks for the feedback these are helpful.

    I’m also super interested in what others are actually doing with your DSS infrastructure. If you are running a DSS instance I’d like to invite you to share your best practices / tips and tricks for keeping your DSS infrastructure running smoothly.

    —Tom

Setup Info
    Tags
      Help me…