Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have an excel file that comes in weekly and I saved it on my desktop. I want to automate this where Dataiku picks up this file automatically when I get a new one. I have looked everywhere but unable to find any information on it. I'm a beginner and this would be super helpful to understand.
Because it is on your desktop, and I assume DSS is hosted on some server somewhere, you won't be able to achieve this if you want to do it from the DSS (i mean, almost anything is possible with the right hardware/software setup, but I don't think it will make sense to do it this way in this case). Not DSS should be picking it. You should push it to DSS using the external DSS API. Please check our API documentation and check an example I created for your reference below. Please note you would need to install "pandas==1.0.5" and "openpyxl" and, of course, "dataiku" packages before you run the script. Also, the below script will create a dataset from your file, so most likely, you would need to modify it to suit your needs.
import dataiku import pandas as pd dataiku.set_remote_dss("http://localhost:19700/", "8MXFDEN9TUEI03DRM3RRNHW33EXJC47W") # update with your host and api key client = dataiku.api_client() dataiku.set_default_project_key("NOTEBOOKS") project = client.get_default_project() file_path = "/path/to/file.xlsx" # add path to your file df = pd.read_excel(file_path, engine='openpyxl') #use temporary file new_dataset_name = "pushed_file2" # here specify a new dataset name builder = project.new_managed_dataset(new_dataset_name) builder.with_store_into("filesystem_managed") # here specify the connection name you are want to use dataset = builder.create() output_dataset = dataiku.Dataset(new_dataset_name) output_dataset.write_with_schema(df)
Otherwise, if your file will reside somewhere externally, like in the S3 bucket, it is also possible to automate it from the DSS.
I hope this helps.
Sorry, I'm completely confused on this and don't know where to start. I looked up the links that you provided for API but "simply confused". My excel file is on a desktop and my recipe is in Dataiku. I just want the new excel file to pick up each time. Is there a simpler method..........?
Ok, there won't be any other method, as the DSS simply has no direct access to your Desktop. The easiest method I can think of is if you have NAS drive or mount some FTP location or some other NFS drive to your computer and store your file there, you will be able to set up DSS to access the file and use the scenario to rebuild your dataset created from the file automatically.
With that said, maybe you just need to step back and see if you can stop using excel and connect your data sources directly to DSS.
If you are using excel for data processing, you don't really need excel at all and much more can be done in DSS itself. You may find the below recourses useful in the above is case:
@Jurre If the engine is not specified, by default, Pandas will use xlrd (at least in the version 1.0.5 I tested the script with), where the latest version that supports .xlsx files is 1.20, so if you have the latest xlrd package installed, you will get an error. If the .xls file is used, there is no need to specify the engine. Openpyxl is recommended package for .xlsx files(reference). Hence as I used .xlsx in the example, I specified the openpyxl as the engine to use.