Change column names with Plugin

gblack686
Level 4
Change column names with Plugin

I have my final dataset and but I want to change the column names based on a naming convention at the end of the code.  Right now I'm using 

output_dataset.write_schema(columns, dropAndCreate=True)

There are no errors in the code, but when I go to explore the data, 'the relation does not exist'. 

After Settings > 'Test Connection' and 'Create Table Now'  works fine... the table has 0 records with the correct column names.

Is there a better function to do this?

0 Kudos
8 Replies
Alex_Combessie
Dataiker Alumni

Hi,

Are you using the write_dataframe method after write_schema? https://doc.dataiku.com/dss/latest/python-api/datasets.html#dataiku.Dataset.write_dataframe

Hope it helps,

Alex

0 Kudos
gblack686
Level 4
Author

The query has the potential of returning hundreds of millions of rows.  Knowing this ,I was avoiding loading into Dataframes. (requires loading data through DSS memory right?)  

0 Kudos
Alex_Combessie
Dataiker Alumni

Hi,

If you do not want to go through pandas.DataFrame() then you can use another part of the Dataiku API:

- SQLExecutor2: https://doc.dataiku.com/dss/latest/python-api/sql.html#dataiku.core.sql.SQLExecutor2.exec_recipe_fra...

- HiveExecutor: https://doc.dataiku.com/dss/latest/python-api/sql.html#dataiku.core.sql.HiveExecutor.exec_recipe_fra...

- ImpalaExecutor: https://doc.dataiku.com/dss/latest/python-api/sql.html#dataiku.core.sql.ImpalaExecutor.exec_recipe_f...

This is assuming your input and output datasets are in SQL or HDFS connections. Otherwise, you will be required to go through the "regular" dataiku API using pandas.DataFrame. Note that you can do chunked reading and writing: https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas

Hope it helps,

Alex

0 Kudos
omri17
Level 2

Hi Alex,
from your referenced docs, i saw a diffrence between SQLExecutor2 and HiveExecutor (using dss 7.0. didnt find HiveExecutor2 if exists...).

How can I define an output dataset for the HiveExecutor.exec_recipe_fragment() ? seems it missing the 
'output_dataset' argument.


from the docs:

HiveExecutor.exec_recipe_fragment(querypre_queries=[]post_queries=[]overwrite_output_schema=Truedrop_partitioned_on_schema_mismatch=Falsemetastore_handling=Noneextra_conf={}add_dku_udf=False)

 

SQLExecutor2 .exec_recipe_fragment(output_datasetquerypre_queries=[]post_queries=[]overwrite_output_schema=Truedrop_partitioned_on_schema_mismatch=False

 

thanks,

Omri

0 Kudos
Alex_Combessie
Dataiker Alumni

Hi,

Apologies for the typo, there is no HiveExecutor2, just HiveExecutor. I have amended my previous post to avoid any confusion.

For the HiveExecutor.exec_recipe_fragment method, no need to specify the output dataset as argument. It will automatically guess it from the configuration of the recipe. You can write a SELECT ... statement in the query parameter, and it will behave like a Hive recipe.

Hope it helps,

Alex

omri17
Level 2

Hi Alex,

Thanks a lot- it work partially.

My original problem (prior your first answer) was when I tried python recipe to write 2 (or more) output datasets. one (or more) of them using HiveExecutor.exec_recipe_fragment. 

when I tried it i got an error:

"Oops: an unexpected error occurred
Error in Python process: At line 27: <class 'Exception'>: Query failed: b'Query has multiple targets but no insert statement'
Please see our options for getting help"

When I changed me python recipe to be with only 1 output dataset- it really worked (thanks!)

So I guess my new quotation is how can I write 2 (or more) output datasets. One (or more) of them using HiveExecutor.exec_recipe_fragment?

I tried to pass the 'dataset' argument when I build the executor, without any luck. Something like: 

executor = HiveExecutor(dataset='dataset_name')

executor.exec_recipe_fragment(query)

but i still got the same error.

Can I do it (write 2 output datasets)?

Thanks,

Omri

0 Kudos
Alex_Combessie
Dataiker Alumni

Hi,

As of today, HiveExecutor.exec_recipe_fragment supports only one output dataset. I have logged the request to handle multiple output datasets in our development backlog.

In the meantime, I would advise several alternatives:

- Use HiveExecutor.query_to_df and then dataiku.Dataset.write_with_schema

- Use ImpalaExecutor (assuming you have Impala)

- Use PySpark: https://doc.dataiku.com/dss/latest/code_recipes/pyspark.html

Hope it helps,

Alex

0 Kudos
omri17
Level 2

Thanks,
We will check it out

 

Omri

0 Kudos