Is there a limit to a directory structure that .list_paths_in_partition() can traverse

Solved!
tgb417
Is there a limit to a directory structure that .list_paths_in_partition() can traverse

All,

I've been working on a project with a fairly deep data structure stored in a remotely mounted DSS "managed folder".

In one of the directories, I've got ~140,000 files.  In another, directory I've got ~170,000 files.

In a smaller directory python code like this seems to correctly pull all of my file paths.

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

input_folder = dataiku.Folder("MYDIRECTORY")
paths = input_folder.list_paths_in_partition()

However, when working with these larger datasets:

  1. Duplicates are showing up in the resulting data.
    1. For the Smaller dataset, I've got ~37,000 records that are duplicated more than once
    2. For the Larger dataset, I've got ~57,000 records that are duplicated more than once.
  2. The file counts are not accurate. 
    1. For the smaller dataset, I'm getting ~137,000 records reported by .list_paths_in_partition()
    2. For the larger dataset, I'm getting ~157,000 records reported by .list_paths_in_partition()

It feels like I've overflowed a ~100,000 record data structure in some way.

I'm working with:

  • Dataiku DSS: 9.0.5
  • I've connected to the remote files by either SFTP or SMB connected local file system.  Both showed similar problems
  • The file system was originally created on Macintosh.
  • In early evaluations, I've not found symbolic Links that might be causing some kind of directory structure looping.

Questions:

  • Has anyone else run into this same problem?
  • If you have run into this problem what approach have you used to get around this problem?
  • If you have used .list_paths_in_partition() from the Dataiku library.  What is the biggest directory structure you have traversed?

--Tom


Operating system used: Mac OS

--Tom
0 Kudos
2 Solutions
Jurre
Level 5

Hi Tom, 

I generally use "find", "stat" and/or "du" depending on what info i need from a set of files. Using "ls" is tempting but according to this article that might not be a great idea. To get output directly in a dataset just make one and select it on the top of the page (next to "pipe out" a dropdown will be populated with datasets you make for this). Here is an example for that

code example with output to textfile :

 

find /home/jurre/dataiku/managed_folders/ -type f -printf '%p\n' >$DKU_OUTPUT_0_FOLDER_PATH/filelist.txt

 

produces a textfile with filesnames from everything inside the managed_folders directory recursively.

Example with output directed to dataset :

 

find /home/jurre/dataiku/managed_folders/* -type f -exec stat --printf='%n\t%s\n' {} +

 

produces a dataset with filename and size in tab-separated format so it gets picked up by DSS > 2 columns. Some other options for "stat" :

  • %n File name,
  • %U User name of owner,
  • %G Group name of owner,
  • %s Total size, in bytes,
  • %x Time of last access,
  • %y Time of last modification,
  • %z Time of last change

Tested on : Ubuntu 18.04, Bash 4.4

EDIT : added links to command-reference, included version info on what this is tested

 

View solution in original post

tgb417
Author

I'm moving to Shell Recipies because of 

  • Fewer error
  • Much faster times.

To get the file details I'm using find and stat.  (This is on OSX so the syntax is a bit Mac centeric.)

 echo 'Date_Birth\tDate_Mod\tDate_Access\tSize\tFull_Path'
find $DKU_INPUT_0_FOLDER_PATH -type f -exec stat -t "%FT%T%z" -f "%SB%t%Sm%t%Sa%t%z%t%N" {} ';'

This does take a visual recipe to extract filenames and extension columns.  However quite straight forward.

To fingerprint the files I'm using 

find $DKU_INPUT_0_FOLDER_PATH -type f -exec md5 {} ';'

This takes a bit of string extraction with regex, however, fairly straightforward.  There are still a number of challenges with the Python Library.  If you had to do this because you are working with an SFTP data source, it could be done slowly and with a number of conditions about the kinds of files you can open.  I've got a ticket open with support.  More to follow maybe.

I want to thank @Jurre for their support. Your suggestions seem to be spot on for Linux.  (I'm also including the Syntax I'm using for Mac OS which is a bit different.)

--Tom

View solution in original post

11 Replies
tgb417
Author

I'm going to open a support ticket as well.  See what comes from that.

--Tom

--Tom
0 Kudos
Jurre
Level 5

Hi Tom, 

When my projects have extensive filesets to work on, like the one you describe, i generally use shell-scripts for file related work. Never experienced any inaccuracy after working with files this way - but still i check that both within DSS and outside. It must be pointed out that i am quite familiar with shell and less so with python (working on that, your other posts here are very informative!).  I hope this gets resolved quickly!

tgb417
Author

I'm trying to pull file details like last mod date, name, size, path and get them into a DSS data set.  Likely several million records.  In shell what would be your approach to extracting that kind of information and getting it into a database dataset.

Thanks for any thoughts you can share.

--Tom

--Tom
0 Kudos
Jurre
Level 5

Hi Tom, 

I generally use "find", "stat" and/or "du" depending on what info i need from a set of files. Using "ls" is tempting but according to this article that might not be a great idea. To get output directly in a dataset just make one and select it on the top of the page (next to "pipe out" a dropdown will be populated with datasets you make for this). Here is an example for that

code example with output to textfile :

 

find /home/jurre/dataiku/managed_folders/ -type f -printf '%p\n' >$DKU_OUTPUT_0_FOLDER_PATH/filelist.txt

 

produces a textfile with filesnames from everything inside the managed_folders directory recursively.

Example with output directed to dataset :

 

find /home/jurre/dataiku/managed_folders/* -type f -exec stat --printf='%n\t%s\n' {} +

 

produces a dataset with filename and size in tab-separated format so it gets picked up by DSS > 2 columns. Some other options for "stat" :

  • %n File name,
  • %U User name of owner,
  • %G Group name of owner,
  • %s Total size, in bytes,
  • %x Time of last access,
  • %y Time of last modification,
  • %z Time of last change

Tested on : Ubuntu 18.04, Bash 4.4

EDIT : added links to command-reference, included version info on what this is tested

 

tgb417
Author

@Jurre ,

Thanks for this detailed response.

I'm having some difficulties re-producing your results on Mac OS at the moment.  This may work better on a Linux computer.  Can you share for others that might come upon this post what Linux and Shell you are using in case they want to reproduce your process?

See below that I've had some success using DSS V10.  So, I think I'm going to pursue that course for now.  I'll come back to the idea of doing this from the shell if I run into further problems.

--Tom

--Tom
0 Kudos
Jurre
Level 5

Sorry to hear that Tom, and great that DSS v10 does the trick! I will update the shell-post with version info and links to man-pages for the functions i mentioned for future reference. 

tgb417
Author

@Jurre 

This is going to be Host OS Specific (in this case for my mac)

However, something like this

find -type f -exec md5sum "{}" + > checklist.chk

Is producing partial results like

MD5 (./VQGAN + CLIP.pdf) = 2eec798c50d23c4afef7860b5aa811ab
MD5 (./data-architecture-basics.pdf) = 24192ec0f11a3aebd4ccf10464bb488e
MD5 (./lullabye.mid) = aec4b0d5bc0b171cebf661c2a755822c

Which is a really good start.

How do you set up your input file systems?

I'm not clear about what steps to use. I've been playing with a connection and Folders, and file systems.  Not clear which is best for DSS?  How are you setting up your connections into a shell recipe?

--Tom

 

 

 

--Tom
0 Kudos
Jurre
Level 5

Hi Tom, 

The fileset i worked on was mounted locally and i have local admin rights. I don't know if below mentioned how-to is the right way but it worked for me :

  • in DSS create a new folder, besides giving it a name leave everything else default
  • open that folder and go to "settings", top right of the screen next to "actions"
  • change the value "read from" to "filesystem_root" and delete variables mentioned in "Path"
  • click "Browse" and browse to the desired folder
  • hit "Save"

If you need to write to this folder check in Administration > Connections > filesystem_root under "usage params" if "allow write" is checked (thank you @KeijiY  for your tip here !) 

It should be noted that that checkbox is there for a reason; possibly serious damage can be done to both DSS and the whole system so please handle with care.

Possibly a better solution if you do not need to write to this folder is to access specific locations from within the script itself without allowing write_actions on filesystem_root. Just create a new folder in Managed_folders to receive results from scripting, check if you have the proper permissions and adjust paths mentioned in the script. You can set the newly made folder as both input and output, without actually using it as an input in the script. Looks funny in the flow 🙂  

 

tgb417
Author

So after an evening of debugging.  Here is what I've discovered:

  • The problem does not seem to be with 
    input_folder.list_paths_in_partition()

    this command does seem to be producing the expected unique values from the file system as evaluated from a Python Notebook.

  • The problem seems to occur when I try to move the data out of the DSS Python Dataframe into the managed PostgreSQL database.  Something strange is happening at 100,000 records.

  • When I moved the project to a design node based on Dataiku DSS 10.0.0. I don't appear to have the same problem. That's good news!
  • However, I now have an open question about my old trusted faithful node running DSS V9.0.5. 

More if and when I learn more.

--Tom

--Tom
0 Kudos
tgb417
Author

So, I'm still finding challenges with:

input_folder.list_paths_in_partition()

If the files name has any file path special characters like: 

\ backslash

* Asterisks

actually in the file name. This causes list_paths_in_partition() to fail. (I know that these are somewhat unusual.  However, I'm dealing with files in the wild.  And they come in all sorts of shapes and sizes.) 

Also, I have a few very large directory with between 16,000 and 60,000 files in a single directory.  SFTP can see them just fine.

However, that directory will also cause list_paths_in_partition() to fail.

Anyone else run into these same or similar problems.  Does, anyone have a more robust way to manage a large number of files on a DSS Connection.

--Tom

 

--Tom
0 Kudos
tgb417
Author

I'm moving to Shell Recipies because of 

  • Fewer error
  • Much faster times.

To get the file details I'm using find and stat.  (This is on OSX so the syntax is a bit Mac centeric.)

 echo 'Date_Birth\tDate_Mod\tDate_Access\tSize\tFull_Path'
find $DKU_INPUT_0_FOLDER_PATH -type f -exec stat -t "%FT%T%z" -f "%SB%t%Sm%t%Sa%t%z%t%N" {} ';'

This does take a visual recipe to extract filenames and extension columns.  However quite straight forward.

To fingerprint the files I'm using 

find $DKU_INPUT_0_FOLDER_PATH -type f -exec md5 {} ';'

This takes a bit of string extraction with regex, however, fairly straightforward.  There are still a number of challenges with the Python Library.  If you had to do this because you are working with an SFTP data source, it could be done slowly and with a number of conditions about the kinds of files you can open.  I've got a ticket open with support.  More to follow maybe.

I want to thank @Jurre for their support. Your suggestions seem to be spot on for Linux.  (I'm also including the Syntax I'm using for Mac OS which is a bit different.)

--Tom

Labels

?

Setup info

?
A banner prompting to get Dataiku