The Dataiku Frontrunner Awards have just launched to recognize your achievements! Submit Your Entry

modify postgre SQL table schema

KT
Level 1
modify postgre SQL table schema

Hi,

I want to column names of a very wide table (500+ columns and 1 million rows).

The following are what I have attempted and they both failed with the exact same error message:

" Invalid argument

An invalid argument has been encountered : Column name not matching between query and schema (index, expected index_changed) SELECT * FROM "target_df" ORDER BY "Date" ASC NULLS FIRST LIMIT 10000. Please check dataset schema."
  1. I followed instruction in this link and created below:

 

 

 

datasets = project.list_datasets()

# Returns a list of DSSDatasetListItem
for dataset in datasets:
        # Quick access to main information in the dataset list item
        print("Name: %s" % dataset.name)
        print("Type: %s" % dataset.type)
        print("Connection: %s" % dataset.connection)
        print("Tags: %s" % dataset.tags) # Returns a list of strings

        # You can also use the list item as a dict of all available dataset information
        print("Raw: %s" % dataset)

identified the desired data set
dataset = project.get_dataset("target_df")
settings = dataset.get_settings()

# print out current column schema
for column in settings.schema_column:
        print("Have column name=%s type=%s" % (column["name"], column["type"]))


curre_schema = settings.schema_columns

## make changes to the first column
curre_schema[0]['name'] = 'index_changed'
print(settings.schema_columns) #printed "index_changed"

#save new schema back into the dataset
settings.save()
​

 

 

When I went and check out the data table, in another window, I got the error mentioned above. Based on the error message, the schema is modified but I lost the entire data set doing this.

2. I followed this link as well:

 

 

# -*- coding: utf-8 -*-
import dataiku
import dataikuapi
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
target_dt = dataiku.Dataset("target_df")
target_dt_df = target_dt.get_dataframe()

#check column names in target_dt_df
target_dt_df.columns # first column is "index"

#change column name for the first column
target_dt_df.rename({'index':'index_changed'}, axis = 1, inplace = True)

# check to ensure the rename has been applied
target_dt_df.columns ## first column is now "index_changed"

#reduced dataframe size as I only care to change the schema
target_dt_df = target_dt_df.iloc[1:3]  

#reflect the latest schema back to the original "target_df"
target_dt.write_schema_from_dataframe(target_dt_df)

 

 

 

I am really confused about why I am only able to change the schema locally but this cannot apply back into the Postgre SQL copy (which is what I really wanted).

 

Any help and insight is very much appreciated.

 

Thank you so much for your time and thoughts.

Regards,

KT

0 Kudos
4 Replies
fchataigner2
Dataiker
Dataiker

Hi

in DSS a postgresql dataset is merely a "view" of the actual table in the DB. So changing column names in the dataset, be it in the UI or via the python API only change DSS's definition of the dataset, and are not magically translated into column modifications in the DB.

If your dataset is the output of some recipe in DSS, ie if DSS manages the dataset, then modifying the recipe producing the dataset to have the new column names, and re-running it, should be all that's needed.

If your dataset is an input to your flow, then you need to change the column names in the DB and re-sync the schema in the dataset's Settings > Connection tab. If you need to script the changes, you can leverage query_to_df() to issue "ALTER TABLE ... RENAME COLUMN ..." statements to the DB (despite the name, the method can run statements that return no data; just remember to add a COMMIT as post-query, since postgresql has transactional DDL)

0 Kudos
KT
Level 1
Author

Thank you very much for your insight. I assure you that I am not looking for any magical here, perhaps misguided by the Dataiku documentation.

If one follows the literal instruction as given by the documentation of Dataiku, it never mentioned that we are merely changing the view. 

It states that "the schema of the dataframe is used to modify the schema of the output dataset". Perhaps this is something which I would insist for Dataiku to write the documentation more clearly and properly express the true behavior of these functions.  I was able to modify the schema but I lost all the data due to this mismatch. If I lost the entire dataset after modifying the schema, what is the practical use case to justify this?

write_with_schema_documentation.jpg

Regarding your proposal, the first option is still not practical, it is a very large data table and it does not make sense to load the entire table into the local system just to modify the table schema.  I am simply interested in a solution where I can change the table schema or a very very large data set without loading the entire table.

I am very thankful that you pointed out about qurey_to_df(),I will have to take another day out of my work to give this a try as there are very limited documentation / examples for me to follow from what's currently available in Dataiku's documentation.  Wish me luck!

Humbly,

0 Kudos
CoreyS
Community Manager
Community Manager

@KT thank you for your feedback and we apologize for any frustration that may have been caused. I reached out to you directly with some more info. Hopefully that will begin to help find you a solution. Thanks again for your reply and utilizing the Dataiku Community.

Looking for more resources to help you use DSS effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!
0 Kudos
SarinaS
Dataiker
Dataiker

Hi @KT,

To follow up on the query_to_df() example, here's an example of executing an ALTER TABLE statement with query_to_df().  I agree that it would be helpful to have additional examples in the documentation for more use cases.  

I have a tiny table compared to your 500 columns, but just for a brief example, here's my original Postgres table: 

sarina=# \d+ "MY_TABLE"
                   Table "public.MY_TABLE"
  Column  | Type | Collation | Nullable | Default | Storage  | Stats target | Description 
----------+------+-----------+----------+---------+----------+--------------+-------------
 new_name | text |           |          |         | extended |              | 
 age      | text |           |          |         | extended |              | 

 

In a DSS Python notebook, I do the following, where my database connection is named "postgres": 

Screen Shot 2021-05-04 at 4.20.16 PM.png

Now in my database, if I check the table again: 

sarina=# \d+ "MY_TABLE"
                     Table "public.DKU_HAIKU_STARTER_1_name_agenew_copy"
   Column    | Type | Collation | Nullable | Default | Storage  | Stats target | Description 
-------------+------+-----------+----------+---------+----------+--------------+-------------
 better_name | text |           |          |         | extended |              | 
 better_age  | text |           |          |         | extended |              | 

 
If you could use any other help executing this for your use case, please let us know where you get stuck and we can take a look! 

Thanks,
Sarina

Labels (3)
A banner prompting to get Dataiku DSS