Creating dataset from Pandas

RadAniba
Level 2
Creating dataset from Pandas

Hello nice people

I am pulling data from REST API in a jupyter notebook in DSS and do a lot of things on the pandas dataframe I am creating

I would like to save the dataframe as a dataset I can later explore whitin the project I am working in.

I am trying something like :

 

if not results.empty:
        output_data = dataiku.Dataset(instrument_name+"_" + event_name + "_"  + timestr)
        output_data.write_dataframe(results)

 

 

But I am always running through a problem

Unable to fetch schema for PROJ1.participant_screening_20210623-233350: dataset does not exist: PROJ1.participant_screening_20210623-233350

 I tried some other alternatives to write the dataframe in a dataset but dss seems to look for a scheman with the project name (PROJ1) ?

Any easy way to get the dataframes into dss datasets ?

PS : this is a test instance, am not using a database for intermediate datasets but writing on disc for testing purposes

Thanks

Rad

0 Kudos
4 Replies
Marlan

Hi @RadAniba,

Following is an example of how to create a new dataset in Python and then write a dataframe to it.

import dataiku

dataset_name = 'TEST'

# Get a handle to the current project
client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())

# Create a SQL dataset (you can create other types by specifying different parameters for the with_store_into method)
# Documentation here: https://doc.dataiku.com/dss/latest/python-api/datasets-other.html#programmatic-creation-and-setup-managed-datasets
# Note that documentation shows project.new_managed_dataset which is incorrect
builder = project.new_managed_dataset_creation_helper(dataset_name)
builder.with_store_into("NZ_DSWRK")
builder.create() 

# Write dataframe to dataset
dataiku.Dataset(dataset_name).write_with_schema(df)

 

The dataset will show in the UI as not built. You can right click on the dataset and choose "mark as built" to fix this.

Note that this example uses both the "external" api (dataikuapi) to create the dataset and the internal api (dataiku) to write the dataframe to the dataset. More the differences here.  

Hope this helps.

Marlan

RadAniba
Level 2
Author

Thank you @Marlan this is really helpful

Would it be possible to create a dataset without the SQL method ? If I understand very well this is based on a postgres connection to store intermediate files, but what if I wanted to test by writing a dataset on a disk instead ?

 

Rad

0 Kudos
Marlan

Hi @RadAniba,

You can write to any type of SQL database you have a connection set up for. The example I gave included a connection for a Netezza database. Not sure what you mean about storing intermediate tables. We write both intermediate and final output data to Netezza.

To create a file dataset, use the filesystem folders connection in the "with_store_into" method, e.g., builder.with_store_into('filesystem_folders')

Note also that you can pass "overwrite=True" to the create method.

Marlan

0 Kudos
RadAniba
Level 2
Author

Fantastic,

 

thank you for clarifying, this helped very much

0 Kudos