Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have a managed folder out on an SFTP Dataiku connection with lots of files. (Hundreds of Thousands to Millions of files.)
I'm able to open the connection and get basic file details. ๐
#...
input_folder = dataiku.Folder("AAAAAAAA")
paths = input_folder.list_paths_in_partition()
#...
path_details = []
for path in paths:
path_details.append(input_folder.get_path_details(path=path))
#...
However, the .get_path_details(path=path) information I get this way is not sufficient to guarantee the uniqueness of the files. I'm not interested in the location of the file to determine uniqueness, but the actual file contents.
I'm wondering if anyone has used a Dataiku Managed Folders to do a Cyclic redundancy check (CRC) against the non-seekable file-like objects that these calls produce. (I understand that this will be slow.)
But, does anyone know how to do a CRC or other "Finger Printing" check against files in Dataiku Managed Folder?
--Tom
Operating system used: Mac OS 10.15.7
Hi,
You'll want to compute a digest on the fly from the stream, as such:
import hashlib
digest = hashlib.md5()
with folder.get_download_stream("myfile") as stream:
while True:
block = stream.read(4096)
if len(block) == 0:
break
digest.update(block)
print(digest.hexdigest())
Hi,
You'll want to compute a digest on the fly from the stream, as such:
import hashlib
digest = hashlib.md5()
with folder.get_download_stream("myfile") as stream:
while True:
block = stream.read(4096)
if len(block) == 0:
break
digest.update(block)
print(digest.hexdigest())
Although this is slow, as expected, it is working fairly well. Thank You!
I ran into an odd bug with a file name that has as itโs last character a new line character \n . ( I never knew this was a thing. ) But this is messy real world dataโฆ
First time I ran this after about 15 hours of processing the step failed on this strange file. The get_download_stream function could not find the file. (Although the list_paths_in_partition() function of the Dataiku library did find the file.)
I ended up adding a try: except: block. The try was included before the with statement and ending after the while loop. Except just returned a blank string.
Can a processes like this be made to run as multiple processes in parallel inside of DSS? Is there some example code of doing this?
Happy Holidaysโฆ