Is there a limit to a directory structure that .list_paths_in_partition() can traverse

Options
tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron

All,

I've been working on a project with a fairly deep data structure stored in a remotely mounted DSS "managed folder".

In one of the directories, I've got ~140,000 files. In another, directory I've got ~170,000 files.

In a smaller directory python code like this seems to correctly pull all of my file paths.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

input_folder = dataiku.Folder("MYDIRECTORY")
paths = input_folder.list_paths_in_partition()

However, when working with these larger datasets:

  1. Duplicates are showing up in the resulting data.
    1. For the Smaller dataset, I've got ~37,000 records that are duplicated more than once
    2. For the Larger dataset, I've got ~57,000 records that are duplicated more than once.
  2. The file counts are not accurate.
    1. For the smaller dataset, I'm getting ~137,000 records reported by .list_paths_in_partition()
    2. For the larger dataset, I'm getting ~157,000 records reported by .list_paths_in_partition()

It feels like I've overflowed a ~100,000 record data structure in some way.

I'm working with:

  • Dataiku DSS: 9.0.5
  • I've connected to the remote files by either SFTP or SMB connected local file system. Both showed similar problems
  • The file system was originally created on Macintosh.
  • In early evaluations, I've not found symbolic Links that might be causing some kind of directory structure looping.

Questions:

  • Has anyone else run into this same problem?
  • If you have run into this problem what approach have you used to get around this problem?
  • If you have used .list_paths_in_partition() from the Dataiku library. What is the biggest directory structure you have traversed?

--Tom


Operating system used: Mac OS

Tagged:

Best Answers

  • Jurre
    Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 114 ✭✭✭✭✭✭✭
    Answer ✓
    Options

    Hi Tom,

    I generally use "find", "stat" and/or "du" depending on what info i need from a set of files. Using "ls" is tempting but according to this article that might not be a great idea. To get output directly in a dataset just make one and select it on the top of the page (next to "pipe out" a dropdown will be populated with datasets you make for this). Here is an example for that

    code example with output to textfile :

    find /home/jurre/dataiku/managed_folders/ -type f -printf '%p\n' >$DKU_OUTPUT_0_FOLDER_PATH/filelist.txt

    produces a textfile with filesnames from everything inside the managed_folders directory recursively.

    Example with output directed to dataset :

    find /home/jurre/dataiku/managed_folders/* -type f -exec stat --printf='%n\t%s\n' {} +

    produces a dataset with filename and size in tab-separated format so it gets picked up by DSS > 2 columns. Some other options for "stat" :

    • %n File name,
    • %U User name of owner,
    • %G Group name of owner,
    • %s Total size, in bytes,
    • %x Time of last access,
    • %y Time of last modification,
    • %z Time of last change

    Tested on : Ubuntu 18.04, Bash 4.4

    EDIT : added links to command-reference, included version info on what this is tested

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Answer ✓
    Options

    I'm moving to Shell Recipies because of

    • Fewer error
    • Much faster times.

    To get the file details I'm using find and stat. (This is on OSX so the syntax is a bit Mac centeric.)

     echo 'Date_Birth\tDate_Mod\tDate_Access\tSize\tFull_Path'
    find $DKU_INPUT_0_FOLDER_PATH -type f -exec stat -t "%FT%T%z" -f "%SB%t%Sm%t%Sa%t%z%t%N" {} ';'

    This does take a visual recipe to extract filenames and extension columns. However quite straight forward.

    To fingerprint the files I'm using

    find $DKU_INPUT_0_FOLDER_PATH -type f -exec md5 {} ';'

    This takes a bit of string extraction with regex, however, fairly straightforward. There are still a number of challenges with the Python Library. If you had to do this because you are working with an SFTP data source, it could be done slowly and with a number of conditions about the kinds of files you can open. I've got a ticket open with support. More to follow maybe.

    I want to thank @Jurre
    for their support. Your suggestions seem to be spot on for Linux. (I'm also including the Syntax I'm using for Mac OS which is a bit different.)

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    I'm going to open a support ticket as well. See what comes from that.

    --Tom

  • Jurre
    Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 114 ✭✭✭✭✭✭✭
    Options

    Hi Tom,

    When my projects have extensive filesets to work on, like the one you describe, i generally use shell-scripts for file related work. Never experienced any inaccuracy after working with files this way - but still i check that both within DSS and outside. It must be pointed out that i am quite familiar with shell and less so with python (working on that, your other posts here are very informative!). I hope this gets resolved quickly!

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    I'm trying to pull file details like last mod date, name, size, path and get them into a DSS data set. Likely several million records. In shell what would be your approach to extracting that kind of information and getting it into a database dataset.

    Thanks for any thoughts you can share.

    --Tom

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    So after an evening of debugging. Here is what I've discovered:

    • The problem does not seem to be with
      input_folder.list_paths_in_partition()

      this command does seem to be producing the expected unique values from the file system as evaluated from a Python Notebook.

    • The problem seems to occur when I try to move the data out of the DSS Python Dataframe into the managed PostgreSQL database. Something strange is happening at 100,000 records.

    • When I moved the project to a design node based on Dataiku DSS 10.0.0. I don't appear to have the same problem. That's good news!
    • However, I now have an open question about my old trusted faithful node running DSS V9.0.5.

    More if and when I learn more.

    --Tom

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @Jurre
    ,

    Thanks for this detailed response.

    I'm having some difficulties re-producing your results on Mac OS at the moment. This may work better on a Linux computer. Can you share for others that might come upon this post what Linux and Shell you are using in case they want to reproduce your process?

    See below that I've had some success using DSS V10. So, I think I'm going to pursue that course for now. I'll come back to the idea of doing this from the shell if I run into further problems.

    --Tom

  • Jurre
    Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 114 ✭✭✭✭✭✭✭
    Options

    Sorry to hear that Tom, and great that DSS v10 does the trick! I will update the shell-post with version info and links to man-pages for the functions i mentioned for future reference.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    So, I'm still finding challenges with:

    input_folder.list_paths_in_partition()

    If the files name has any file path special characters like:

    \ backslash

    * Asterisks

    actually in the file name. This causes list_paths_in_partition() to fail. (I know that these are somewhat unusual. However, I'm dealing with files in the wild. And they come in all sorts of shapes and sizes.)

    Also, I have a few very large directory with between 16,000 and 60,000 files in a single directory. SFTP can see them just fine.

    However, that directory will also cause list_paths_in_partition() to fail.

    Anyone else run into these same or similar problems. Does, anyone have a more robust way to manage a large number of files on a DSS Connection.

    --Tom

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @Jurre

    This is going to be Host OS Specific (in this case for my mac)

    However, something like this

    find -type f -exec md5sum "{}" + > checklist.chk

    Is producing partial results like

    MD5 (./VQGAN + CLIP.pdf) = 2eec798c50d23c4afef7860b5aa811ab
    MD5 (./data-architecture-basics.pdf) = 24192ec0f11a3aebd4ccf10464bb488e
    MD5 (./lullabye.mid) = aec4b0d5bc0b171cebf661c2a755822c

    Which is a really good start.

    How do you set up your input file systems?

    I'm not clear about what steps to use. I've been playing with a connection and Folders, and file systems. Not clear which is best for DSS? How are you setting up your connections into a shell recipe?

    --Tom

  • Jurre
    Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 114 ✭✭✭✭✭✭✭
    Options

    Hi Tom,

    The fileset i worked on was mounted locally and i have local admin rights. I don't know if below mentioned how-to is the right way but it worked for me :

    • in DSS create a new folder, besides giving it a name leave everything else default
    • open that folder and go to "settings", top right of the screen next to "actions"
    • change the value "read from" to "filesystem_root" and delete variables mentioned in "Path"
    • click "Browse" and browse to the desired folder
    • hit "Save"

    If you need to write to this folder check in Administration > Connections > filesystem_root under "usage params" if "allow write" is checked (thank you @KeijiY
    for your tip here !)

    It should be noted that that checkbox is there for a reason; possibly serious damage can be done to both DSS and the whole system so please handle with care.

    Possibly a better solution if you do not need to write to this folder is to access specific locations from within the script itself without allowing write_actions on filesystem_root. Just create a new folder in Managed_folders to receive results from scripting, check if you have the proper permissions and adjust paths mentioned in the script. You can set the newly made folder as both input and output, without actually using it as an input in the script. Looks funny in the flow

Setup Info
    Tags
      Help me…