Sharepoint connection and file parsing

shoareau
shoareau Partner, Dataiku DSS Core Designer, Registered Posts: 8 Partner

I have successfully established a Sharepont Connection with my Sharepoint.

But now, with a Python script i want to parse this sharepoint, in order to list the available files, and apply another python script to one of them.

When i create a Python Receipe , it creates a header "Dataset' like:

InputDLISSharedpoint = dataiku.Dataset("InputDLISSharedpoint")
InputDLISSharedpoint_df = InputDLISSharedpoint.get_dataframe()

How can i parse this sharepoint to list all the files?

Thank you

Tagged:

Answers

  • AlexB
    AlexB Dataiker, Registered Posts: 68 Dataiker

    Hi !

    In your current setup, you have used a custom dataset to access SharePoint. Doing this, the files contained in the SharePoint directory are already selected and opened, and the following python recipe as only access to the actual content.

    To access the file list, you need instead to access your SharePoint directory as a managed folder, and then use the python corresponding API, which can be found here.

    To do that, go to your Dataiku project flow > +Dataset > Folder, and pick Sharepoint online / shared document in for the "Store into" box.

    create_folder.png

     At first an error message will appear because the settings are not selected. Go to Settings > Type of authentication and pick your preset. Next, use the Browse option to go to the SharePoint folder of interest. Once it is selected, press Save. In the View tab, you should now see the content of your SharePoint directory.

    From this folder, you can create a python recipe:

    flow.png

    The recipe's template code will now use the dataiku.Folder method. Using list_paths_in_partition() should give you the list of files in the SharePoint folder.

    code.png

    logs.png

     Hope this will help,

    Alex

  • shoareau
    shoareau Partner, Dataiku DSS Core Designer, Registered Posts: 8 Partner

    Thank you for the answer, it helps !

    A complementary point is comming up.

    I thought i could then direct , based on the availabe list file, read one specfic file.

    But it s not so obvious , as it is not a MAnaged folder.

    So supposing the list files includes 2 binary files.

    I need to get the absolute path of those file to be able to read them .

    I have a specific python library processing this file, i just have to pass the argument of the file location. LEt s say this libray calls ProcessBinFile().

    But in the case of Folder relared to a sharepoint, what is the best way to handle it?

    - Shoudl i use get_download_stream API ?

    - Should i copy the file form this sharepoint to a managed folder, and use it more easily?

    Right now , i have access to the file list, but i can not hanlde the file (wheter open it , or copy it...) due to this path issue i think.

  • AlexB
    AlexB Dataiker, Registered Posts: 68 Dataiker

    Indeed, in this context you have to use get_download_stream to open the file.

    It is actually a good thing to access your files this way, since it makes your code able to work regardless to the underlying method used to access the actual storage (local or not).

  • shoareau
    shoareau Partner, Dataiku DSS Core Designer, Registered Posts: 8 Partner

    Thank you ALex!

    And whenever i want to copy this file available to this Sharepoint folder to another managed folder, what is recommended?

    Regards

  • AlexB
    AlexB Dataiker, Registered Posts: 68 Dataiker
    edited July 17

    If it is one file, with always the same name, you can open it as a dataset (just like you did in your initial flow), and then use the "Export to folder" recipe which let save the data, in various file format, in a managed folder.

    If it is something more complex (multiple files, some processing to do in between), you can have a recipe reading from a SharePoint folder, writing into a local folder, and use get_download_stream + read / get_writer + write. For instance, a code like this would copy all the file present in the input folder (be it a SharePoint folder or anything else) into your target folder (which can be local):

    # -*- coding: utf-8 -*-
    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # Read recipe inputs
    InputDLISSharedpoint = dataiku.Folder("RIsZUUUk")
    InputDLISSharedpoint_info = InputDLISSharedpoint.get_info()
    
    paths = InputDLISSharedpoint.list_paths_in_partition()
    print("File list: {}".format(paths))
    
    # Write recipe outputs
    target = dataiku.Folder("wCWCzPXy")
    target_info = target.get_info()
    
    for path in paths:
        data = None
        with InputDLISSharedpoint.get_download_stream(path) as input_file:
            data = input_file.read()
        with target.get_writer(path) as output_file:
            output_file.write(data)
    

  • UserKp
    UserKp Registered Posts: 20

    if i have multiple files in the sharepoint folder like df1,df2,df3 and so on can I create a combined dataset from this folder,is it possible?

    so I want to read all the files from this sharepoint folder and then stack all the dataframes,so those df1,df2 etc have same column names

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron

    Please ceate a new thread when you ask a new question. You can use the Files in Folder dataset if all the files have the same structure and will load them and stack them in one single go:

    https://community.dataiku.com/t5/Using-Dataiku/Using-the-quot-Files-in-folder-quot-dataset/m-p/33214

  • UserKp
    UserKp Registered Posts: 20

    if the data is refreshed in some of the files in SharePoint will it autorefresh after I use that stacked dataset in any Python recipe?

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,088 Neuron

    Please post a new thread.

Setup Info
    Tags
      Help me…