"Finger Printing" files in a Managed Folder
I have a managed folder out on an SFTP Dataiku connection with lots of files. (Hundreds of Thousands to Millions of files.)
I'm able to open the connection and get basic file details.
#...
input_folder = dataiku.Folder("AAAAAAAA")
paths = input_folder.list_paths_in_partition()
#...
path_details = []
for path in paths:
path_details.append(input_folder.get_path_details(path=path))
#...
However, the .get_path_details(path=path) information I get this way is not sufficient to guarantee the uniqueness of the files. I'm not interested in the location of the file to determine uniqueness, but the actual file contents.
I'm wondering if anyone has used a Dataiku Managed Folders to do a Cyclic redundancy check (CRC) against the non-seekable file-like objects that these calls produce. (I understand that this will be slow.)
But, does anyone know how to do a CRC or other "Finger Printing" check against files in Dataiku Managed Folder?
--Tom
Operating system used: Mac OS 10.15.7
Best Answer
-
Hi,
You'll want to compute a digest on the fly from the stream, as such:
import hashlib digest = hashlib.md5() with folder.get_download_stream("myfile") as stream: while True: block = stream.read(4096) if len(block) == 0: break digest.update(block) print(digest.hexdigest())
Answers
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
-
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
Although this is slow, as expected, it is working fairly well. Thank You!
I ran into an odd bug with a file name that has as it’s last character a new line character \n . ( I never knew this was a thing. ) But this is messy real world data…
First time I ran this after about 15 hours of processing the step failed on this strange file. The get_download_stream function could not find the file. (Although the list_paths_in_partition() function of the Dataiku library did find the file.)
I ended up adding a try: except: block. The try was included before the with statement and ending after the while loop. Except just returned a blank string.
Can a processes like this be made to run as multiple processes in parallel inside of DSS? Is there some example code of doing this?
Happy Holidays…