Change column names with Plugin

gblack686 · April 2020

I have my final dataset and but I want to change the column names based on a naming convention at the end of the code. Right now I'm using

output_dataset.write_schema(columns, dropAndCreate=True)

There are no errors in the code, but when I go to explore the data, 'the relation does not exist'.

After Settings > 'Test Connection' and 'Create Table Now' works fine... the table has 0 records with the correct column names.

Is there a better function to do this?

Alex_Combessie · April 2020

Hi,

Are you using the write_dataframe method after write_schema? https://doc.dataiku.com/dss/latest/python-api/datasets.html#dataiku.Dataset.write_dataframe

Hope it helps,

Alex

gblack686 · April 2020

The query has the potential of returning hundreds of millions of rows. Knowing this ,I was avoiding loading into Dataframes. (requires loading data through DSS memory right?)

Alex_Combessie · April 2020

Hi,

If you do not want to go through pandas.DataFrame() then you can use another part of the Dataiku API:

- SQLExecutor2: https://doc.dataiku.com/dss/latest/python-api/sql.html#dataiku.core.sql.SQLExecutor2.exec_recipe_fragment

- HiveExecutor: https://doc.dataiku.com/dss/latest/python-api/sql.html#dataiku.core.sql.HiveExecutor.exec_recipe_fragment

- ImpalaExecutor: https://doc.dataiku.com/dss/latest/python-api/sql.html#dataiku.core.sql.ImpalaExecutor.exec_recipe_fragment

This is assuming your input and output datasets are in SQL or HDFS connections. Otherwise, you will be required to go through the "regular" dataiku API using pandas.DataFrame. Note that you can do chunked reading and writing: https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas

Hope it helps,

Alex

omri17 · September 2020

Hi Alex,
from your referenced docs, i saw a diffrence between SQLExecutor2 and HiveExecutor (using dss 7.0. didnt find HiveExecutor2 if exists...).

How can I define an output dataset for the HiveExecutor.exec_recipe_fragment() ? seems it missing the
'output_dataset' argument.

from the docs:

HiveExecutor.exec_recipe_fragment(query, pre_queries=[], post_queries=[], overwrite_output_schema=True, drop_partitioned_on_schema_mismatch=False, metastore_handling=None, extra_conf={}, add_dku_udf=False)

SQLExecutor2 .exec_recipe_fragment(output_dataset, query, pre_queries=[], post_queries=[], overwrite_output_schema=True, drop_partitioned_on_schema_mismatch=False

thanks,

Omri

Alex_Combessie · September 2020

Hi,

Apologies for the typo, there is no HiveExecutor2, just HiveExecutor. I have amended my previous post to avoid any confusion.

For the HiveExecutor.exec_recipe_fragment method, no need to specify the output dataset as argument. It will automatically guess it from the configuration of the recipe. You can write a SELECT ... statement in the query parameter, and it will behave like a Hive recipe.

Hope it helps,

Alex

omri17 · September 2020

Hi Alex,

Thanks a lot- it work partially.

My original problem (prior your first answer) was when I tried python recipe to write 2 (or more) output datasets. one (or more) of them using HiveExecutor.exec_recipe_fragment.

when I tried it i got an error:

"Oops: an unexpected error occurred
Error in Python process: At line 27: <class 'Exception'>: Query failed: b'Query has multiple targets but no insert statement'
Please see our options for getting help"

When I changed me python recipe to be with only 1 output dataset- it really worked (thanks!)

So I guess my new quotation is how can I write 2 (or more) output datasets. One (or more) of them using HiveExecutor.exec_recipe_fragment?

I tried to pass the 'dataset' argument when I build the executor, without any luck. Something like:

executor = HiveExecutor(dataset='dataset_name')

executor.exec_recipe_fragment(query)

but i still got the same error.

Can I do it (write 2 output datasets)?

Thanks,

Omri

Alex_Combessie · September 2020

Hi,

As of today, HiveExecutor.exec_recipe_fragment supports only one output dataset. I have logged the request to handle multiple output datasets in our development backlog.

In the meantime, I would advise several alternatives:

- Use HiveExecutor.query_to_df and then dataiku.Dataset.write_with_schema

- Use ImpalaExecutor (assuming you have Impala)

- Use PySpark: https://doc.dataiku.com/dss/latest/code_recipes/pyspark.html

Hope it helps,

Alex

omri17 · September 2020

Thanks,
We will check it out

Omri

Change column names with Plugin

Answers

Categories

Setup Info

Tags