Upload excel file monthly

MARU
MARU Registered Posts: 9 ✭✭✭

Hi

I have an excel file that comes in weekly and I saved it on my desktop. I want to automate this where Dataiku picks up this file automatically when I get a new one. I have looked everywhere but unable to find any information on it. I'm a beginner and this would be super helpful to understand.

Answers

  • VitaliyD
    VitaliyD Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 102 Dataiker
    edited July 17

    Hi,

    Because it is on your desktop, and I assume DSS is hosted on some server somewhere, you won't be able to achieve this if you want to do it from the DSS (i mean, almost anything is possible with the right hardware/software setup, but I don't think it will make sense to do it this way in this case). Not DSS should be picking it. You should push it to DSS using the external DSS API. Please check our API documentation and check an example I created for your reference below. Please note you would need to install "pandas==1.0.5" and "openpyxl" and, of course, "dataiku" packages before you run the script. Also, the below script will create a dataset from your file, so most likely, you would need to modify it to suit your needs.

    import dataiku
    import pandas as pd
    
    dataiku.set_remote_dss("http://localhost:19700/", "8MXFDEN9TUEI03DRM3RRNHW33EXJC47W") # update with your host and api key
    
    client = dataiku.api_client()
    dataiku.set_default_project_key("NOTEBOOKS")
    
    project = client.get_default_project()
    
    file_path = "/path/to/file.xlsx" # add path to your file
    
    df = pd.read_excel(file_path, engine='openpyxl') #use temporary file
    
    new_dataset_name = "pushed_file2" # here specify a new dataset name
    builder = project.new_managed_dataset(new_dataset_name)
    builder.with_store_into("filesystem_managed") # here specify the connection name you are want to use
    dataset = builder.create()
    output_dataset = dataiku.Dataset(new_dataset_name)
    output_dataset.write_with_schema(df)

    Otherwise, if your file will reside somewhere externally, like in the S3 bucket, it is also possible to automate it from the DSS.

    I hope this helps.

    Best.

  • MARU
    MARU Registered Posts: 9 ✭✭✭

    Sorry, I'm completely confused on this and don't know where to start. I looked up the links that you provided for API but "simply confused". My excel file is on a desktop and my recipe is in Dataiku. I just want the new excel file to pick up each time. Is there a simpler method..........?

  • VitaliyD
    VitaliyD Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 102 Dataiker

    Ok, there won't be any other method, as the DSS simply has no direct access to your Desktop. The easiest method I can think of is if you have NAS drive or mount some FTP location or some other NFS drive to your computer and store your file there, you will be able to set up DSS to access the file and use the scenario to rebuild your dataset created from the file automatically.

    With that said, maybe you just need to step back and see if you can stop using excel and connect your data sources directly to DSS.

    If you are using excel for data processing, you don't really need excel at all and much more can be done in DSS itself. You may find the below recourses useful in the above is case:

    Excel to Dataiku Quickstart (link to KB page ) and the Excel to Dataiku Playbook.

  • Jurre
    Jurre Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts, Registered, Dataiku DSS Developer, Neuron 2022 Posts: 114 ✭✭✭✭✭✭✭

    @VitaliyD
    is there a specific reason for using openpyxl here? Pandas can handle that out of the box i would think..

  • VitaliyD
    VitaliyD Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 102 Dataiker

    @Jurre
    If the engine is not specified, by default, Pandas will use xlrd (at least in the version 1.0.5 I tested the script with), where the latest version that supports .xlsx files is 1.20, so if you have the latest xlrd package installed, you will get an error. If the .xls file is used, there is no need to specify the engine. Openpyxl is recommended package for .xlsx files(reference). Hence as I used .xlsx in the example, I specified the openpyxl as the engine to use.

    Best.

Setup Info
    Tags
      Help me…