Sharepoint connection and file parsing

shoareau
Level 1
Sharepoint connection and file parsing

I have successfully established a Sharepont Connection with my Sharepoint.

But now, with a Python script i want to parse this sharepoint, in order to list the available files, and apply another python script to one of them.

 

When i create a Python Receipe , it creates a header "Dataset' like:

InputDLISSharedpoint = dataiku.Dataset("InputDLISSharedpoint")
InputDLISSharedpoint_df = InputDLISSharedpoint.get_dataframe()

 

How can i parse this sharepoint to list all the files?

 

 

Thank you

 

0 Kudos
9 Replies
AlexB
Dataiker

Hi !

In your current setup, you have used a custom dataset to access SharePoint. Doing this, the files contained in the SharePoint directory are already selected and opened, and the following python recipe as only access to the actual content.

To access the file list, you need instead to access your SharePoint directory as a managed folder, and then use the python corresponding API, which can be found here

To do that, go to your Dataiku project flow > +Dataset > Folder, and pick Sharepoint online / shared document in for the "Store into" box.

create_folder.png

โ€ƒAt first an error message will appear because the settings are not selected. Go to Settings > Type of authentication and pick your preset. Next, use the Browse option to go to the SharePoint folder of interest. Once it is selected, press Save. In the View tab, you should now see the content of your SharePoint directory.

From this folder, you can create a python recipe:

flow.png

โ€ƒ

The recipe's template code will now use the dataiku.Folder method. Using list_paths_in_partition() should give you the list of files in the SharePoint folder.

code.png

โ€ƒ

logs.png

โ€ƒHope this will help,

Alex

 

0 Kudos
shoareau
Level 1
Author

Thank you for the answer, it helps !

 

A complementary point is comming up.

I thought i could then direct , based on the availabe list file, read one specfic file.

But it s not so obvious , as it is not a MAnaged folder.

So supposing the list files includes 2 binary files.

I need to get the absolute path of those file to be able to read them .

I have a specific python library processing this file, i just have to pass the argument of the file location. LEt s say this libray calls ProcessBinFile().

But in the case of Folder relared to a sharepoint, what is the best way to handle it?

- Shoudl i use get_download_stream API ?

- Should i copy the file form this sharepoint to a managed folder, and use it more easily?

Right now , i have access to the file list, but i can not hanlde the file (wheter open it , or copy it...) due to this path issue i think.

 

 

0 Kudos
AlexB
Dataiker

Indeed, in this context you have to use get_download_stream to open the file.

It is actually a good thing to access your files this way, since it makes your code able to work regardless to the underlying method used to access the actual storage (local or not).

 

0 Kudos
shoareau
Level 1
Author

Thank you  ALex!

And whenever i want to copy this file available to this Sharepoint folder to another managed folder, what is recommended?

 

 

Regards

 

0 Kudos
AlexB
Dataiker

If it is one file, with always the same name, you can open it as a dataset (just like you did in your initial flow), and then use the "Export to folder" recipe which let save the data, in various file format, in a managed folder.

If it is something more complex (multiple files, some processing to do in between), you can have a recipe reading from a SharePoint folder, writing into a local folder, and use get_download_stream + read / get_writer + write. For instance, a code like this would copy all the file present in the input folder (be it a SharePoint folder or anything else) into your target folder (which can be local):

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
InputDLISSharedpoint = dataiku.Folder("RIsZUUUk")
InputDLISSharedpoint_info = InputDLISSharedpoint.get_info()

paths = InputDLISSharedpoint.list_paths_in_partition()
print("File list: {}".format(paths))

# Write recipe outputs
target = dataiku.Folder("wCWCzPXy")
target_info = target.get_info()

for path in paths:
    data = None
    with InputDLISSharedpoint.get_download_stream(path) as input_file:
        data = input_file.read()
    with target.get_writer(path) as output_file:
        output_file.write(data)

 

0 Kudos
UserKp
Level 3

if i have multiple files in the sharepoint folder like df1,df2,df3 and so on can I create a combined dataset from this folder,is it possible?

so I want to read all the files from this sharepoint folder and then stack all the dataframes,so those df1,df2 etc have same column names

0 Kudos
Turribeach

Please ceate a new thread when you ask a new question. You can use the Files in Folder dataset if all the files have the same structure and will load them and stack them in one single go:

https://community.dataiku.com/t5/Using-Dataiku/Using-the-quot-Files-in-folder-quot-dataset/m-p/33214

 

0 Kudos
UserKp
Level 3

if the data is refreshed in some of the files in SharePoint will it autorefresh after I use that stacked dataset in any Python recipe?

0 Kudos
Turribeach

Please post a new thread. 

0 Kudos