Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Added on November 28, 2021 3:46AM
Likes: 0
Replies: 11
All,
I've been working on a project with a fairly deep data structure stored in a remotely mounted DSS "managed folder".
In one of the directories, I've got ~140,000 files. In another, directory I've got ~170,000 files.
In a smaller directory python code like this seems to correctly pull all of my file paths.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
input_folder = dataiku.Folder("MYDIRECTORY")
paths = input_folder.list_paths_in_partition()
However, when working with these larger datasets:
It feels like I've overflowed a ~100,000 record data structure in some way.
I'm working with:
Questions:
--Tom
Operating system used: Mac OS
Hi Tom,
I generally use "find", "stat" and/or "du" depending on what info i need from a set of files. Using "ls" is tempting but according to this article that might not be a great idea. To get output directly in a dataset just make one and select it on the top of the page (next to "pipe out" a dropdown will be populated with datasets you make for this). Here is an example for that
code example with output to textfile :
find /home/jurre/dataiku/managed_folders/ -type f -printf '%p\n' >$DKU_OUTPUT_0_FOLDER_PATH/filelist.txt
produces a textfile with filesnames from everything inside the managed_folders directory recursively.
Example with output directed to dataset :
find /home/jurre/dataiku/managed_folders/* -type f -exec stat --printf='%n\t%s\n' {} +
produces a dataset with filename and size in tab-separated format so it gets picked up by DSS > 2 columns. Some other options for "stat" :
Tested on : Ubuntu 18.04, Bash 4.4
EDIT : added links to command-reference, included version info on what this is tested
I'm moving to Shell Recipies because of
To get the file details I'm using find and stat. (This is on OSX so the syntax is a bit Mac centeric.)
echo 'Date_Birth\tDate_Mod\tDate_Access\tSize\tFull_Path'
find $DKU_INPUT_0_FOLDER_PATH -type f -exec stat -t "%FT%T%z" -f "%SB%t%Sm%t%Sa%t%z%t%N" {} ';'
This does take a visual recipe to extract filenames and extension columns. However quite straight forward.
To fingerprint the files I'm using
find $DKU_INPUT_0_FOLDER_PATH -type f -exec md5 {} ';'
This takes a bit of string extraction with regex, however, fairly straightforward. There are still a number of challenges with the Python Library. If you had to do this because you are working with an SFTP data source, it could be done slowly and with a number of conditions about the kinds of files you can open. I've got a ticket open with support. More to follow maybe.
I want to thank @Jurre
for their support. Your suggestions seem to be spot on for Linux. (I'm also including the Syntax I'm using for Mac OS which is a bit different.)
I'm going to open a support ticket as well. See what comes from that.
--Tom
Hi Tom,
When my projects have extensive filesets to work on, like the one you describe, i generally use shell-scripts for file related work. Never experienced any inaccuracy after working with files this way - but still i check that both within DSS and outside. It must be pointed out that i am quite familiar with shell and less so with python (working on that, your other posts here are very informative!). I hope this gets resolved quickly!
I'm trying to pull file details like last mod date, name, size, path and get them into a DSS data set. Likely several million records. In shell what would be your approach to extracting that kind of information and getting it into a database dataset.
Thanks for any thoughts you can share.
--Tom
So after an evening of debugging. Here is what I've discovered:
input_folder.list_paths_in_partition()
this command does seem to be producing the expected unique values from the file system as evaluated from a Python Notebook.
The problem seems to occur when I try to move the data out of the DSS Python Dataframe into the managed PostgreSQL database. Something strange is happening at 100,000 records.
More if and when I learn more.
--Tom
@Jurre
,
Thanks for this detailed response.
I'm having some difficulties re-producing your results on Mac OS at the moment. This may work better on a Linux computer. Can you share for others that might come upon this post what Linux and Shell you are using in case they want to reproduce your process?
See below that I've had some success using DSS V10. So, I think I'm going to pursue that course for now. I'll come back to the idea of doing this from the shell if I run into further problems.
--Tom
Sorry to hear that Tom, and great that DSS v10 does the trick! I will update the shell-post with version info and links to man-pages for the functions i mentioned for future reference.
So, I'm still finding challenges with:
input_folder.list_paths_in_partition()
If the files name has any file path special characters like:
\ backslash
* Asterisks
actually in the file name. This causes list_paths_in_partition() to fail. (I know that these are somewhat unusual. However, I'm dealing with files in the wild. And they come in all sorts of shapes and sizes.)
Also, I have a few very large directory with between 16,000 and 60,000 files in a single directory. SFTP can see them just fine.
However, that directory will also cause list_paths_in_partition() to fail.
Anyone else run into these same or similar problems. Does, anyone have a more robust way to manage a large number of files on a DSS Connection.
--Tom
This is going to be Host OS Specific (in this case for my mac)
However, something like this
find -type f -exec md5sum "{}" + > checklist.chk
Is producing partial results like
MD5 (./VQGAN + CLIP.pdf) = 2eec798c50d23c4afef7860b5aa811ab
MD5 (./data-architecture-basics.pdf) = 24192ec0f11a3aebd4ccf10464bb488e
MD5 (./lullabye.mid) = aec4b0d5bc0b171cebf661c2a755822c
Which is a really good start.
How do you set up your input file systems?
I'm not clear about what steps to use. I've been playing with a connection and Folders, and file systems. Not clear which is best for DSS? How are you setting up your connections into a shell recipe?
--Tom
Hi Tom,
The fileset i worked on was mounted locally and i have local admin rights. I don't know if below mentioned how-to is the right way but it worked for me :
If you need to write to this folder check in Administration > Connections > filesystem_root under "usage params" if "allow write" is checked (thank you @KeijiY
for your tip here !)
It should be noted that that checkbox is there for a reason; possibly serious damage can be done to both DSS and the whole system so please handle with care.
Possibly a better solution if you do not need to write to this folder is to access specific locations from within the script itself without allowing write_actions on filesystem_root. Just create a new folder in Managed_folders to receive results from scripting, check if you have the proper permissions and adjust paths mentioned in the script. You can set the newly made folder as both input and output, without actually using it as an input in the script. Looks funny in the flow