Change column names with Plugin
I have my final dataset and but I want to change the column names based on a naming convention at the end of the code. Right now I'm using
output_dataset.write_schema(columns, dropAndCreate=True)
There are no errors in the code, but when I go to explore the data, 'the relation does not exist'.
After Settings > 'Test Connection' and 'Create Table Now' works fine... the table has 0 records with the correct column names.
Is there a better function to do this?
Answers
-
Hi,
Are you using the write_dataframe method after write_schema? https://doc.dataiku.com/dss/latest/python-api/datasets.html#dataiku.Dataset.write_dataframe
Hope it helps,
Alex
-
The query has the potential of returning hundreds of millions of rows. Knowing this ,I was avoiding loading into Dataframes. (requires loading data through DSS memory right?)
-
Hi,
If you do not want to go through pandas.DataFrame() then you can use another part of the Dataiku API:
- SQLExecutor2: https://doc.dataiku.com/dss/latest/python-api/sql.html#dataiku.core.sql.SQLExecutor2.exec_recipe_fragment
- HiveExecutor: https://doc.dataiku.com/dss/latest/python-api/sql.html#dataiku.core.sql.HiveExecutor.exec_recipe_fragment
- ImpalaExecutor: https://doc.dataiku.com/dss/latest/python-api/sql.html#dataiku.core.sql.ImpalaExecutor.exec_recipe_fragment
This is assuming your input and output datasets are in SQL or HDFS connections. Otherwise, you will be required to go through the "regular" dataiku API using pandas.DataFrame. Note that you can do chunked reading and writing: https://doc.dataiku.com/dss/latest/python-api/datasets.html#chunked-reading-and-writing-with-pandas
Hope it helps,
Alex
-
Hi Alex,
from your referenced docs, i saw a diffrence between SQLExecutor2 and HiveExecutor (using dss 7.0. didnt find HiveExecutor2 if exists...).How can I define an output dataset for the HiveExecutor.exec_recipe_fragment() ? seems it missing the
'output_dataset' argument.
from the docs:HiveExecutor.exec_recipe_fragment(query, pre_queries=[], post_queries=[], overwrite_output_schema=True, drop_partitioned_on_schema_mismatch=False, metastore_handling=None, extra_conf={}, add_dku_udf=False)
SQLExecutor2 .exec_recipe_fragment(output_dataset, query, pre_queries=[], post_queries=[], overwrite_output_schema=True, drop_partitioned_on_schema_mismatch=False
thanks,
Omri
-
Hi,
Apologies for the typo, there is no HiveExecutor2, just HiveExecutor. I have amended my previous post to avoid any confusion.
For the HiveExecutor.exec_recipe_fragment method, no need to specify the output dataset as argument. It will automatically guess it from the configuration of the recipe. You can write a SELECT ... statement in the query parameter, and it will behave like a Hive recipe.
Hope it helps,
Alex
-
Hi Alex,
Thanks a lot- it work partially.
My original problem (prior your first answer) was when I tried python recipe to write 2 (or more) output datasets. one (or more) of them using HiveExecutor.exec_recipe_fragment.
when I tried it i got an error:
"Oops: an unexpected error occurred
Error in Python process: At line 27: <class 'Exception'>: Query failed: b'Query has multiple targets but no insert statement'
Please see our options for getting help"When I changed me python recipe to be with only 1 output dataset- it really worked (thanks!)
So I guess my new quotation is how can I write 2 (or more) output datasets. One (or more) of them using HiveExecutor.exec_recipe_fragment?
I tried to pass the 'dataset' argument when I build the executor, without any luck. Something like:
executor = HiveExecutor(dataset='dataset_name')
executor.exec_recipe_fragment(query)
but i still got the same error.
Can I do it (write 2 output datasets)?
Thanks,
Omri
-
Hi,
As of today, HiveExecutor.exec_recipe_fragment supports only one output dataset. I have logged the request to handle multiple output datasets in our development backlog.
In the meantime, I would advise several alternatives:
- Use HiveExecutor.query_to_df and then dataiku.Dataset.write_with_schema
- Use ImpalaExecutor (assuming you have Impala)
- Use PySpark: https://doc.dataiku.com/dss/latest/code_recipes/pyspark.html
Hope it helps,
Alex
-
Thanks,
We will check it outOmri