Automating BigQuery Dataset Creation via Python Notebook

Hello everyone,
I am currently working on a project where I want to automate the transfer of data from a database to BigQuery using a Python notebook in Dataiku. I have many tables involved and I aim to automate the entire process because I don't want to create each dataset manually.
I found information on dataset creation in Dataiku here: Documentation Dataiku - Dataset Creation. However, I am facing difficulties in defining the BigQuery dataset and the table name where the data should be placed.
If anyone has done something similar or has advice on how to specify the dataset and table name in the code, it would be incredibly helpful. Any code examples or additional references would also be greatly appreciated!
Thank you in advance for your help,
Guillaume
import dataiku
client = dataiku.api_client()
project_key = dataiku.default_project_key()
project = client.get_project(project_key)
#create the dataset
builder = project.new_managed_dataset("test1")
builder.with_store_into(connection= "XXX")
dataset = builder.create(overwrite=True)
Operating system used: Windows 10
Best Answer
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,501 Neuron
Here is some sample code changing some Dataset settings:
client = dataiku.api_client() project = client.get_project("project") dataset = project.get_dataset("dataset") dataset_settings = dataset.get_settings() dataset_settings.get_raw()['flowOptions']['rebuildBehavior']='WRITE_PROTECT' dataset_settings.save()
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,501 Neuron
I would imagine you set these using get_creation_settings() and get_settings() . The easiest way to see how this should be set is to create a new managed dataset manually, then inspect the settings via the API and replicate it for your new datasets created programmatically.
-
Hello,
Thanks for your suggestion. I attempted to follow your advice by inspecting the settings of a manually created dataset via the API. However, I encountered an error using
get_settings()
method:import dataiku
mydataset = dataiku.Dataset("TEST_GJA")
df = mydataset.get_dataframe()
mydataset.get_settings() ---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[35], line 1
----> 1 mydataset.get_settings()
AttributeError: 'Dataset' object has no attribute 'get_settings'I manually created a dataset as suggested but am struggling to retrieve its settings programmatically.
Thx
Guillaume
-
Hello,
Thanks to the guidance I received earlier, I've managed to access the settings of a dataset using the Dataiku API. However, I'm currently facing challenges in defining the 'schema' parameter for an existing dataset. Here’s the code I’ve been using:
import dataiku
from dataiku import api_client
client = api_client()
project = client.get_project("PYTHONSANDBOX")
dataset = project.get_dataset("TEST_GJA")
settings = dataset.get_settings()I understand that modifying dataset schemas generally involves using the
settings
object. However, I'm unsure how to set the 'schema' property correctly for an existing dataset. If anyone has experience or examples of how to update the schema via the API, your assistance would be greatly valued.Thank you in advance for your help!
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,501 Neuron
dataiku.Dataset() is a different class and has different methods:
vs
-
Thanks so much! Your code worked perfectly for setting the parameters, including the schema. I really appreciate your help!