Checksum of file in managed folder

Solved!
stanjer
Level 1
Checksum of file in managed folder

Hello community,

i have a usecase where i want to upload a file to a DSS managed folder via the python api. Now the question came up if there is a possibility to get the checksum of the file in the managed folder via the api? I want to verify after the upload that the file in DSS is indeed still the same file that i uploaded.

Is it possible to do this via API or would I have to use ssh to connect to the server directly? Any other ideas to achieve this?

Thank you for any input!

 

1 Solution
tgb417

@stanjer ,

I am working with files on a locally mounted volume.  Here is the script that I use to md5 from a Macintosh computer.  Other OS variants may need slightly different scripts.


The results of this script is setup to:

  • Output to a specific Dataiku data set connected to the recipe
  • "Auto-infer output schema"  is checked
  • "Treat First Line as Header" is not checked. 

You could put in an echo statement to output a header. 

MOUNTVOLUME="test"
find /Volumes/$MOUNTVOLUME -type f -exec md5 {} ';'

I then use a visual prepare recipe with a few steps to extract the useful data.

Note that this method would only work with some kind of locally mounted file system.

This method is also considerably faster than many methods I've tried.  I was able to do ~450,000 files in about  8-9 hours on a slowish connection.

--Tom

View solution in original post

4 Replies
tgb417

@stanjer 

Welcome to the Dataiku Community.

You might find the following threads a bit helpful.

https://community.dataiku.com/t5/Using-Dataiku/quot-Finger-Printing-quot-files-in-a-Managed-Folder/m...

On a project I ran in a few years in the past I did find that doing checksums via the Shell Recipe to be several times faster than other methods.  It will depend on the size of your file(s) and your need to use Dataiku Connections in your use case.

--Tom
stanjer
Level 1
Author

Hi Tom,

thank you for the reply. Sorry that it took me a little longer to get back. Indeed your comment was very helpful for us and although we put that topic in the backlog for now, the shell recipe is an interesting solution i didn't have on my radar up until now.

Kind regards

Jan

0 Kudos
tgb417

@stanjer ,

I am working with files on a locally mounted volume.  Here is the script that I use to md5 from a Macintosh computer.  Other OS variants may need slightly different scripts.


The results of this script is setup to:

  • Output to a specific Dataiku data set connected to the recipe
  • "Auto-infer output schema"  is checked
  • "Treat First Line as Header" is not checked. 

You could put in an echo statement to output a header. 

MOUNTVOLUME="test"
find /Volumes/$MOUNTVOLUME -type f -exec md5 {} ';'

I then use a visual prepare recipe with a few steps to extract the useful data.

Note that this method would only work with some kind of locally mounted file system.

This method is also considerably faster than many methods I've tried.  I was able to do ~450,000 files in about  8-9 hours on a slowish connection.

--Tom
tgb417

@stanjer 

Glad this helped at least a little bit.

--Tom
0 Kudos