Want to Stop Rebuilding "Expensive" Parts of your Flow? Explicit Builds are the Answer!READ MORE

Upload excel file monthly

MARU
Level 2
Upload excel file monthly

Hi

I have an excel file that comes in weekly and I saved it on my desktop.  I want to automate this where Dataiku picks up this file automatically when I get a new one.  I have looked everywhere but unable to find any information on it.  I'm a beginner and this would be super helpful to understand.

0 Kudos
5 Replies
VitaliyD
Dataiker
Dataiker

Hi,

Because it is on your desktop, and I assume DSS is hosted on some server somewhere, you won't be able to achieve this if you want to do it from the DSS (i mean, almost anything is possible with the right hardware/software setup, but I don't think it will make sense to do it this way in this case). Not DSS should be picking it. You should push it to DSS using the external DSS API. Please check our API documentation and check an example I created for your reference below. Please note you would need to install "pandas==1.0.5" and "openpyxl" and, of course, "dataiku" packages before you run the script. Also, the below script will create a dataset from your file, so most likely, you would need to modify it to suit your needs.

import dataiku
import pandas as pd

dataiku.set_remote_dss("http://localhost:19700/", "8MXFDEN9TUEI03DRM3RRNHW33EXJC47W") # update with your host and api key

client = dataiku.api_client()
dataiku.set_default_project_key("NOTEBOOKS")

project = client.get_default_project()

file_path = "/path/to/file.xlsx" # add path to your file

df = pd.read_excel(file_path, engine='openpyxl') #use temporary file

new_dataset_name = "pushed_file2" # here specify a new dataset name
builder = project.new_managed_dataset(new_dataset_name)
builder.with_store_into("filesystem_managed") # here specify the connection name you are want to use
dataset = builder.create()
output_dataset = dataiku.Dataset(new_dataset_name)
output_dataset.write_with_schema(df)

 

Otherwise, if your file will reside somewhere externally, like in the S3 bucket, it is also possible to automate it from the DSS.

I hope this helps.

Best.

0 Kudos
MARU
Level 2
Author

Sorry, I'm completely confused on this and don't know where to start.  I looked up the links that you provided for API but "simply confused".  My excel file is on a desktop and my recipe is in Dataiku.  I just want the new excel file to pick up each time.  Is there a simpler method..........?

0 Kudos
VitaliyD
Dataiker
Dataiker

Ok, there won't be any other method, as the DSS simply has no direct access to your Desktop. The easiest method I can think of is if you have NAS drive or mount some FTP location or some other NFS drive to your computer and store your file there, you will be able to set up DSS to access the file and use the scenario to rebuild your dataset created from the file automatically.

With that said, maybe you just need to step back and see if you can stop using excel and connect your data sources directly to DSS.

If you are using excel for data processing, you don't really need excel at all and much more can be done in DSS itself. You may find the below recourses useful in the above is case:

Excel to Dataiku Quickstart (link to KB page ) and the Excel to Dataiku Playbook.

Jurre
Neuron
Neuron

@VitaliyD  is there a specific reason for using openpyxl here?  Pandas can handle that out of the box i would think.. 

0 Kudos
VitaliyD
Dataiker
Dataiker

@Jurre If the engine is not specified, by default, Pandas will use xlrd (at least in the version 1.0.5 I tested the script with), where the latest version that supports .xlsx files is 1.20, so if you have the latest xlrd package installed, you will get an error. If the .xls file is used, there is no need to specify the engine. Openpyxl is recommended package for .xlsx files(reference). Hence as I used .xlsx in the example, I specified the openpyxl as the engine to use.

Best.