Checksum of file in managed folder

stanjer
stanjer Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 2

Hello community,

i have a usecase where i want to upload a file to a DSS managed folder via the python api. Now the question came up if there is a possibility to get the checksum of the file in the managed folder via the api? I want to verify after the upload that the file in DSS is indeed still the same file that i uploaded.

Is it possible to do this via API or would I have to use ssh to connect to the server directly? Any other ideas to achieve this?

Thank you for any input!

Tagged:

Best Answer

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
    edited July 17 Answer ✓

    @stanjer
    ,

    I am working with files on a locally mounted volume. Here is the script that I use to md5 from a Macintosh computer. Other OS variants may need slightly different scripts.


    The results of this script is setup to:

    • Output to a specific Dataiku data set connected to the recipe
    • "Auto-infer output schema" is checked
    • "Treat First Line as Header" is not checked.

    You could put in an echo statement to output a header.

    MOUNTVOLUME="test"
    find /Volumes/$MOUNTVOLUME -type f -exec md5 {} ';'

    I then use a visual prepare recipe with a few steps to extract the useful data.

    Note that this method would only work with some kind of locally mounted file system.

    This method is also considerably faster than many methods I've tried. I was able to do ~450,000 files in about 8-9 hours on a slowish connection.

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    @stanjer

    Welcome to the Dataiku Community.

    You might find the following threads a bit helpful.

    https://community.dataiku.com/t5/Using-Dataiku/quot-Finger-Printing-quot-files-in-a-Managed-Folder/m-p/21914

    On a project I ran in a few years in the past I did find that doing checksums via the Shell Recipe to be several times faster than other methods. It will depend on the size of your file(s) and your need to use Dataiku Connections in your use case.

  • stanjer
    stanjer Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 2

    Hi Tom,

    thank you for the reply. Sorry that it took me a little longer to get back. Indeed your comment was very helpful for us and although we put that topic in the backlog for now, the shell recipe is an interesting solution i didn't have on my radar up until now.

    Kind regards

    Jan

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    @stanjer

    Glad this helped at least a little bit.

Setup Info
    Tags
      Help me…