Export a pandas dataframe directly to a local file
Hello everybody,
I am building a Jupyter Notebook in a Dataiku project, and I would like to know if it is possible to export a pandas DataFrame directly to my local computer.
I saw discussions in the forum that explained how to export pandas Dataframes into a Dataiku Managed folder, but I would like to go one step further.
Ideally, I would like to automate it (through a scenario or something like that).
Thank you in advance for your help. Have a good day.
Best regards,
Jean-Luc.
Best Answer
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,901 Neuron
It's always best to explain what you are trying to do since there could be many ways of achieving it. So based on your requirements it makes no sense to export anything from a Jupyter Notebook if you want to automate this. Also you wouldn't run it from a Jupyter Notebook if you want to automate this in Dataiku. Having said that it's perfectly fine you are experimenting using a Jupyter Notebook while you get the code and output working. While you are developing in a Jupyter Notebook you can use the trick I shared to inspect the output df and make sure it has what you want. But sooner or later you will need to convert you Jupyter Notebook into a Python recipe. You can do that clicking on the CREATE RECIPE button (*). Then define an output dataset and write to it. The best to automate your recipe run is to use a Scenario. And the best to export the dataset output and share with your team will probably be an email is to have a Mail Reporter in your scenario settings and add the Dataset as an Excel attachment. If you want the reporter to run every time the scenario is executed change the Run Condition to: outcome == 'SUCCESS'
Finally to your questions:- where is HTML library located (I can see HTML keyword in your function)?
- There is no HTML library. The code generates a href HTML tag which is then displayed in the Notebook
- what does "data" and "colon" mean in the url defined inside your function?
- This is part of the data URI scheme which is supported by a href HTML tags.
(*) Note that you can continue to edit your Python Recipe in a Jupyter Notebook if you prefer, but when it runs automated it needs to be in a Recipe. Make sure you always save any changes done in the Jupyter Notebook back to the Recipe.
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,901 Neuron
Your Jupyter Notebook running in a Dataiku server will not have access to your local computer unless of course you do some sort of network share or you are running Dataiku in your own machine. So assuming you are using Dataiku in a Server the best you can do you is export to a file and let you download it on the Notebook itself:
import base64 def create_download_link( df, title = "Download CSV file", filename = "data.csv"): csv = df.to_csv() b64 = base64.b64encode(csv.encode()) payload = b64.decode() html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>' html = html.format(payload=payload,title=title,filename=filename) return HTML(html) create_download_link(df)
This piece of code will create a download link for a df data frame into a CSV in your Notebook.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,901 Neuron
Also it will be best if you explain exactly what are you trying to achieve as we might be able to suggest better ways of doing it. Downloading a file to your local machine is clearly not the goal, but a means of achieving it.
-
Hello Turribeach,
Thank you for your reply. I have just two questions :- where is HTML library located (I can see HTML keyword in your function) ?
- what does "data" and "colon" mean in the url defined inside your function ?
My goal is to inventory all the Dataiku projects we use in order to identify the datasets, recipies, scenarios, and then check if, for example, datasets are unused, in order to clean the projects, or optimize them.
This information is then collectes into pandas DataFrame, that I would like to export and share with my team, and lead further investigations if necessary.
So my initial idea was to use the Dataiky API to achieve this, through a Dataiku notebook (I found a model in the tutorials).
Note that this export does not need to be run very frequently. I am currently in a developement phase, and I try to look for the best way to achive this.
Any help will be greatly appreciated.
Best regards,
Jean-Luc