Creating dataset from Pandas
Hello nice people
I am pulling data from REST API in a jupyter notebook in DSS and do a lot of things on the pandas dataframe I am creating
I would like to save the dataframe as a dataset I can later explore whitin the project I am working in.
I am trying something like :
Unable to fetch schema for PROJ1.participant_screening_20210623-233350: dataset does not exist: PROJ1.participant_screening_20210623-233350
But I am always running through a problem
if not results.empty: output_data = dataiku.Dataset(instrument_name+"_" + event_name + "_" + timestr) output_data.write_dataframe(results)
I tried some other alternatives to write the dataframe in a dataset but dss seems to look for a scheman with the project name (PROJ1) ?
Any easy way to get the dataframes into dss datasets ?
PS : this is a test instance, am not using a database for intermediate datasets but writing on disc for testing purposes
Thanks
Rad
Answers
-
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 321 Neuron
Hi @RadAniba
,Following is an example of how to create a new dataset in Python and then write a dataframe to it.
import dataiku dataset_name = 'TEST' # Get a handle to the current project client = dataiku.api_client() project = client.get_project(dataiku.default_project_key()) # Create a SQL dataset (you can create other types by specifying different parameters for the with_store_into method) # Documentation here: https://doc.dataiku.com/dss/latest/python-api/datasets-other.html#programmatic-creation-and-setup-managed-datasets # Note that documentation shows project.new_managed_dataset which is incorrect builder = project.new_managed_dataset_creation_helper(dataset_name) builder.with_store_into("NZ_DSWRK") builder.create() # Write dataframe to dataset dataiku.Dataset(dataset_name).write_with_schema(df)
The dataset will show in the UI as not built. You can right click on the dataset and choose "mark as built" to fix this.
Note that this example uses both the "external" api (dataikuapi) to create the dataset and the internal api (dataiku) to write the dataframe to the dataset. More the differences here.
Hope this helps.
Marlan
-
Thank you @Marlan
this is really helpfulWould it be possible to create a dataset without the SQL method ? If I understand very well this is based on a postgres connection to store intermediate files, but what if I wanted to test by writing a dataset on a disk instead ?
Rad
-
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 321 Neuron
Hi @RadAniba
,You can write to any type of SQL database you have a connection set up for. The example I gave included a connection for a Netezza database. Not sure what you mean about storing intermediate tables. We write both intermediate and final output data to Netezza.
To create a file dataset, use the filesystem folders connection in the "with_store_into" method, e.g., builder.with_store_into('filesystem_folders')
Note also that you can pass "overwrite=True" to the create method.
Marlan
-
Fantastic,
thank you for clarifying, this helped very much