Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi! I'd like to import a Hive table to a project (from a notebook) using the Dataiku's Python API. The idea is to replicate the process done through the UI (which is successfull, as you may see in the picture below):
After doing it through the UI, then this table appears as a dataset in the 'Dataset' page of the project (this is what I need)
However, when I try to do the same process on a notebook I get an error. I have tried two approaches:
1) First approach:
import dataiku
client = dataiku.api_client()
project = client.get_project('MYPROJECT')
import_definition = project.init_tables_import()
import_definition.add_hive_table("referenciales", "sbl_tipo_identificacion")
prepared_import = import_definition.prepare()
future = prepared_import.execute()
import_result = future.wait_for_result()
Gives the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-52-432f67d71601> in <module>
2 import_definition.add_hive_table("referenciales", "sbl_tipo_identificacion")
3
----> 4 prepared_import = import_definition.prepare()
5 future = prepared_import.execute()
6
~/Dataiku_install/dataiku-dss-8.0.4/python/dataikuapi/dss/project.py in prepare(self)
1327
1328 future = self.client.get_future(ret["jobId"])
-> 1329 future.wait_for_result()
1330 return TablesPreparedImport(self.client, self.project_key, future.get_result())
1331
~/Dataiku_install/dataiku-dss-8.0.4/python/dataikuapi/dss/future.py in wait_for_result(self)
73 Wait and get the future result
74 """
---> 75 if self.state.get('hasResult', False):
76 return self.result_wrapper(self.state.get('result', None))
77 if self.state is None or not self.state.get('hasResult', False) or self.state_is_peek:
AttributeError: 'NoneType' object has no attribute 'get'
2) Second approach:
from dataiku.core.sql import SQLExecutor2
#Building dataset where the result of the query will be stored.
builder = project.new_managed_dataset_creation_helper("temp_dataset")
builder.with_store_into("hdfs_connection", format_option_id="PARQUET_HIVE")
dataset = builder.create()
executor = SQLExecutor2(connection="referenciales")
executor.exec_recipe_fragment(temp_dataset, "select * from sbl_tipo_identificacion", overwrite_output_schema=True)
When trying this, the following error is printed:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-33-13217d6dde60> in <module>
2
3 #output_dataset = dataiku.Dataset("temp_outputDataset2")
----> 4 SQLExecutor2.exec_recipe_fragment(output_dataset, streamed_query)
~/Dataiku_install/dataiku-dss-8.0.4/python/dataiku/core/sql.py in exec_recipe_fragment(output_dataset, query, pre_queries, post_queries, overwrite_output_schema, drop_partitioned_on_schema_mismatch)
181 data={
182 "outputDataset": output_dataset.full_name,
--> 183 "activityId" : spec["currentActivityId"],
184 "query" : query,
185 "preQueries" : json.dumps(pre_queries),
TypeError: 'NoneType' object is not subscriptable
Anyone knows why do these errors occur? or maybe some other methods to try? Thanks!
Hi,
for the first issue, you should update to 9.0.4 to get the fix for this bug. The second issue comes from using exec_recipe_fragment, which can only be used in recipe (as the name implies). Also, even if it had worked, you code would have extracted the table from hive and reloaded it into another dataset, effectively duplicating the data; this is probably not what you're looking for
Hi! Got it. Thanks.
However, it's very unlikely I get to use version 9 given the fact that I'm not the admin, thus can't run the update. Do you know any other way I can solve the mentioned problem?