"Finger Printing" files in a Managed Folder

Solved!
tgb417
"Finger Printing" files in a Managed Folder

I have a managed folder out on an SFTP Dataiku connection with lots of files.  (Hundreds of Thousands to Millions of files.)  

I'm able to open the connection and get basic file details😀

#...

input_folder = dataiku.Folder("AAAAAAAA")
paths = input_folder.list_paths_in_partition()

#...

path_details = []
for path in paths:
path_details.append(input_folder.get_path_details(path=path))
#...

However, the .get_path_details(path=path) information I get this way is not sufficient to guarantee the uniqueness of the files.  I'm not interested in the location of the file to determine uniqueness, but the actual file contents.

I'm wondering if anyone has used a Dataiku Managed Folders to do a Cyclic redundancy check (CRC) against the non-seekable file-like objects that these calls produce.  (I understand that this will be slow.)

But, does anyone know how to do a CRC or other "Finger Printing" check against files in Dataiku Managed Folder?

--Tom


Operating system used: Mac OS 10.15.7

--Tom
0 Kudos
1 Solution
Clément_Stenac
Dataiker

Hi,

You'll want to compute a digest on the fly from the stream, as such:

import hashlib
digest = hashlib.md5()
with folder.get_download_stream("myfile") as stream:
    while True:
        block = stream.read(4096)
        if len(block) == 0:
            break
        digest.update(block)
print(digest.hexdigest())

View solution in original post

3 Replies
Clément_Stenac
Dataiker

Hi,

You'll want to compute a digest on the fly from the stream, as such:

import hashlib
digest = hashlib.md5()
with folder.get_download_stream("myfile") as stream:
    while True:
        block = stream.read(4096)
        if len(block) == 0:
            break
        digest.update(block)
print(digest.hexdigest())
tgb417
Author

@Clément_Stenac 

Thanks for the feedback.  I’ve given this a try so far so good. 

—Tom

--Tom
0 Kudos
tgb417
Author

Although this is slow, as expected, it is working fairly well.  Thank You!

I ran into an odd bug with a file name that has as it’s last character a new line character \n . ( I never knew this was a thing. ) But this is messy real world data…

First time I ran this after about 15 hours of processing the step failed on this strange file. The get_download_stream function could not find the file.  (Although the list_paths_in_partition() function of the Dataiku library did find the file.)

I ended up adding a try: except: block.  The try was included before the with statement and ending after the while loop.  Except just returned a blank string. 

Can a processes like this be made to run as multiple processes in parallel inside of DSS? Is there some example code of doing this?  

Happy Holidays…

--Tom
0 Kudos