Hello nice people
I am pulling data from REST API in a jupyter notebook in DSS and do a lot of things on the pandas dataframe I am creating
I would like to save the dataframe as a dataset I can later explore whitin the project I am working in.
I am trying something like :
if not results.empty: output_data = dataiku.Dataset(instrument_name+"_" + event_name + "_" + timestr) output_data.write_dataframe(results)
But I am always running through a problem
Unable to fetch schema for PROJ1.participant_screening_20210623-233350: dataset does not exist: PROJ1.participant_screening_20210623-233350
I tried some other alternatives to write the dataframe in a dataset but dss seems to look for a scheman with the project name (PROJ1) ?
Any easy way to get the dataframes into dss datasets ?
PS : this is a test instance, am not using a database for intermediate datasets but writing on disc for testing purposes
Following is an example of how to create a new dataset in Python and then write a dataframe to it.
import dataiku dataset_name = 'TEST' # Get a handle to the current project client = dataiku.api_client() project = client.get_project(dataiku.default_project_key()) # Create a SQL dataset (you can create other types by specifying different parameters for the with_store_into method) # Documentation here: https://doc.dataiku.com/dss/latest/python-api/datasets-other.html#programmatic-creation-and-setup-managed-datasets # Note that documentation shows project.new_managed_dataset which is incorrect builder = project.new_managed_dataset_creation_helper(dataset_name) builder.with_store_into("NZ_DSWRK") builder.create() # Write dataframe to dataset dataiku.Dataset(dataset_name).write_with_schema(df)
The dataset will show in the UI as not built. You can right click on the dataset and choose "mark as built" to fix this.
Note that this example uses both the "external" api (dataikuapi) to create the dataset and the internal api (dataiku) to write the dataframe to the dataset. More the differences here.
Hope this helps.
Thank you @Marlan this is really helpful
Would it be possible to create a dataset without the SQL method ? If I understand very well this is based on a postgres connection to store intermediate files, but what if I wanted to test by writing a dataset on a disk instead ?
You can write to any type of SQL database you have a connection set up for. The example I gave included a connection for a Netezza database. Not sure what you mean about storing intermediate tables. We write both intermediate and final output data to Netezza.
To create a file dataset, use the filesystem folders connection in the "with_store_into" method, e.g., builder.with_store_into('filesystem_folders')
Note also that you can pass "overwrite=True" to the create method.