Problem when trying to create a new dataset in project with a Hive table (from Python API)
Hi! I'd like to import a Hive table to a project (from a notebook) using the Dataiku's Python API. The idea is to replicate the process done through the UI (which is successfull, as you may see in the picture below):
After doing it through the UI, then this table appears as a dataset in the 'Dataset' page of the project (this is what I need)
However, when I try to do the same process on a notebook I get an error. I have tried two approaches:
1) First approach:
import dataiku client = dataiku.api_client() project = client.get_project('MYPROJECT') import_definition = project.init_tables_import() import_definition.add_hive_table("referenciales", "sbl_tipo_identificacion") prepared_import = import_definition.prepare() future = prepared_import.execute() import_result = future.wait_for_result()
Gives the following error:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-52-432f67d71601> in <module> 2 import_definition.add_hive_table("referenciales", "sbl_tipo_identificacion") 3 ----> 4 prepared_import = import_definition.prepare() 5 future = prepared_import.execute() 6 ~/Dataiku_install/dataiku-dss-8.0.4/python/dataikuapi/dss/project.py in prepare(self) 1327 1328 future = self.client.get_future(ret["jobId"]) -> 1329 future.wait_for_result() 1330 return TablesPreparedImport(self.client, self.project_key, future.get_result()) 1331 ~/Dataiku_install/dataiku-dss-8.0.4/python/dataikuapi/dss/future.py in wait_for_result(self) 73 Wait and get the future result 74 """ ---> 75 if self.state.get('hasResult', False): 76 return self.result_wrapper(self.state.get('result', None)) 77 if self.state is None or not self.state.get('hasResult', False) or self.state_is_peek: AttributeError: 'NoneType' object has no attribute 'get'
2) Second approach:
from dataiku.core.sql import SQLExecutor2 #Building dataset where the result of the query will be stored. builder = project.new_managed_dataset_creation_helper("temp_dataset") builder.with_store_into("hdfs_connection", format_option_id="PARQUET_HIVE") dataset = builder.create() executor = SQLExecutor2(connection="referenciales") executor.exec_recipe_fragment(temp_dataset, "select * from sbl_tipo_identificacion", overwrite_output_schema=True)
When trying this, the following error is printed:
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-33-13217d6dde60> in <module> 2 3 #output_dataset = dataiku.Dataset("temp_outputDataset2") ----> 4 SQLExecutor2.exec_recipe_fragment(output_dataset, streamed_query) ~/Dataiku_install/dataiku-dss-8.0.4/python/dataiku/core/sql.py in exec_recipe_fragment(output_dataset, query, pre_queries, post_queries, overwrite_output_schema, drop_partitioned_on_schema_mismatch) 181 data={ 182 "outputDataset": output_dataset.full_name, --> 183 "activityId" : spec["currentActivityId"], 184 "query" : query, 185 "preQueries" : json.dumps(pre_queries), TypeError: 'NoneType' object is not subscriptable
Anyone knows why do these errors occur? or maybe some other methods to try? Thanks!
Answers
-
Hi,
for the first issue, you should update to 9.0.4 to get the fix for this bug. The second issue comes from using exec_recipe_fragment, which can only be used in recipe (as the name implies). Also, even if it had worked, you code would have extracted the table from hive and reloaded it into another dataset, effectively duplicating the data; this is probably not what you're looking for
-
Hi! Got it. Thanks.
However, it's very unlikely I get to use version 9 given the fact that I'm not the admin, thus can't run the update. Do you know any other way I can solve the mentioned problem?