Creating dataset from Pandas

RadAniba · June 2021

Hello nice people

I am pulling data from REST API in a jupyter notebook in DSS and do a lot of things on the pandas dataframe I am creating

I would like to save the dataframe as a dataset I can later explore whitin the project I am working in.

I am trying something like :

Unable to fetch schema for PROJ1.participant_screening_20210623-233350: dataset does not exist: PROJ1.participant_screening_20210623-233350

But I am always running through a problem

if not results.empty:
        output_data = dataiku.Dataset(instrument_name+"_" + event_name + "_"  + timestr)
        output_data.write_dataframe(results)

I tried some other alternatives to write the dataframe in a dataset but dss seems to look for a scheman with the project name (PROJ1) ?

Any easy way to get the dataframes into dss datasets ?

PS : this is a test instance, am not using a database for intermediate datasets but writing on disc for testing purposes

Thanks

Rad

Marlan · June 2021

Hi @RadAniba
,

Following is an example of how to create a new dataset in Python and then write a dataframe to it.

import dataiku

dataset_name = 'TEST'

# Get a handle to the current project
client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())

# Create a SQL dataset (you can create other types by specifying different parameters for the with_store_into method)
# Documentation here: https://doc.dataiku.com/dss/latest/python-api/datasets-other.html#programmatic-creation-and-setup-managed-datasets
# Note that documentation shows project.new_managed_dataset which is incorrect
builder = project.new_managed_dataset_creation_helper(dataset_name)
builder.with_store_into("NZ_DSWRK")
builder.create() 

# Write dataframe to dataset
dataiku.Dataset(dataset_name).write_with_schema(df)

The dataset will show in the UI as not built. You can right click on the dataset and choose "mark as built" to fix this.

Note that this example uses both the "external" api (dataikuapi) to create the dataset and the internal api (dataiku) to write the dataframe to the dataset. More the differences here.

Hope this helps.

Marlan

RadAniba · June 2021

Thank you @Marlan
this is really helpful

Would it be possible to create a dataset without the SQL method ? If I understand very well this is based on a postgres connection to store intermediate files, but what if I wanted to test by writing a dataset on a disk instead ?

Rad

Marlan · June 2021

Hi @RadAniba
,

You can write to any type of SQL database you have a connection set up for. The example I gave included a connection for a Netezza database. Not sure what you mean about storing intermediate tables. We write both intermediate and final output data to Netezza.

To create a file dataset, use the filesystem folders connection in the "with_store_into" method, e.g., builder.with_store_into('filesystem_folders')

Note also that you can pass "overwrite=True" to the create method.

Marlan

RadAniba · June 2021

Fantastic,

thank you for clarifying, this helped very much

Creating dataset from Pandas

Answers

Categories

Setup Info

Tags