"Finger Printing" files in a Managed Folder

Options
tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
edited July 16 in Using Dataiku

I have a managed folder out on an SFTP Dataiku connection with lots of files. (Hundreds of Thousands to Millions of files.)

I'm able to open the connection and get basic file details.

#...

input_folder = dataiku.Folder("AAAAAAAA")
paths = input_folder.list_paths_in_partition()

#...

path_details = []
for path in paths:
path_details.append(input_folder.get_path_details(path=path))
#...

However, the .get_path_details(path=path) information I get this way is not sufficient to guarantee the uniqueness of the files. I'm not interested in the location of the file to determine uniqueness, but the actual file contents.

I'm wondering if anyone has used a Dataiku Managed Folders to do a Cyclic redundancy check (CRC) against the non-seekable file-like objects that these calls produce. (I understand that this will be slow.)

But, does anyone know how to do a CRC or other "Finger Printing" check against files in Dataiku Managed Folder?

--Tom


Operating system used: Mac OS 10.15.7

Best Answer

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    edited July 17 Answer ✓
    Options

    Hi,

    You'll want to compute a digest on the fly from the stream, as such:

    import hashlib
    digest = hashlib.md5()
    with folder.get_download_stream("myfile") as stream:
        while True:
            block = stream.read(4096)
            if len(block) == 0:
                break
            digest.update(block)
    print(digest.hexdigest())

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @Clément_Stenac

    Thanks for the feedback. I’ve given this a try so far so good.

    —Tom

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    Although this is slow, as expected, it is working fairly well. Thank You!

    I ran into an odd bug with a file name that has as it’s last character a new line character \n . ( I never knew this was a thing. ) But this is messy real world data…

    First time I ran this after about 15 hours of processing the step failed on this strange file. The get_download_stream function could not find the file. (Although the list_paths_in_partition() function of the Dataiku library did find the file.)

    I ended up adding a try: except: block. The try was included before the with statement and ending after the while loop. Except just returned a blank string.

    Can a processes like this be made to run as multiple processes in parallel inside of DSS? Is there some example code of doing this?

    Happy Holidays…

Setup Info
    Tags
      Help me…