Problem when trying to create a new dataset in project with a Hive table (from Python API)

esteban23 · September 2021

Hi! I'd like to import a Hive table to a project (from a notebook) using the Dataiku's Python API. The idea is to replicate the process done through the UI (which is successfull, as you may see in the picture below):

After doing it through the UI, then this table appears as a dataset in the 'Dataset' page of the project (this is what I need)

However, when I try to do the same process on a notebook I get an error. I have tried two approaches:

1) First approach:

import dataiku
client = dataiku.api_client()
project = client.get_project('MYPROJECT')
import_definition = project.init_tables_import()
import_definition.add_hive_table("referenciales", "sbl_tipo_identificacion")

prepared_import = import_definition.prepare()
future = prepared_import.execute()

import_result = future.wait_for_result()

Gives the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-52-432f67d71601> in <module>
      2 import_definition.add_hive_table("referenciales", "sbl_tipo_identificacion")
      3 
----> 4 prepared_import = import_definition.prepare()
      5 future = prepared_import.execute()
      6 

~/Dataiku_install/dataiku-dss-8.0.4/python/dataikuapi/dss/project.py in prepare(self)
   1327 
   1328         future = self.client.get_future(ret["jobId"])
-> 1329         future.wait_for_result()
   1330         return TablesPreparedImport(self.client, self.project_key, future.get_result())
   1331 

~/Dataiku_install/dataiku-dss-8.0.4/python/dataikuapi/dss/future.py in wait_for_result(self)
     73         Wait and get the future result
     74         """
---> 75         if self.state.get('hasResult', False):
     76             return self.result_wrapper(self.state.get('result', None))
     77         if self.state is None or not self.state.get('hasResult', False) or self.state_is_peek:

AttributeError: 'NoneType' object has no attribute 'get'

2) Second approach:

from dataiku.core.sql import SQLExecutor2

#Building dataset where the result of the query will be stored.
builder = project.new_managed_dataset_creation_helper("temp_dataset")
builder.with_store_into("hdfs_connection", format_option_id="PARQUET_HIVE")
dataset = builder.create()

executor = SQLExecutor2(connection="referenciales")
executor.exec_recipe_fragment(temp_dataset, "select * from sbl_tipo_identificacion", overwrite_output_schema=True)

When trying this, the following error is printed:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-33-13217d6dde60> in <module>
      2 
      3 #output_dataset = dataiku.Dataset("temp_outputDataset2")
----> 4 SQLExecutor2.exec_recipe_fragment(output_dataset, streamed_query)

~/Dataiku_install/dataiku-dss-8.0.4/python/dataiku/core/sql.py in exec_recipe_fragment(output_dataset, query, pre_queries, post_queries, overwrite_output_schema, drop_partitioned_on_schema_mismatch)
    181             data={
    182                 "outputDataset": output_dataset.full_name,
--> 183                 "activityId" : spec["currentActivityId"],
    184                 "query" : query,
    185                 "preQueries" : json.dumps(pre_queries),

TypeError: 'NoneType' object is not subscriptable

Anyone knows why do these errors occur? or maybe some other methods to try? Thanks!

fchataigner2 · September 2021

Hi,

for the first issue, you should update to 9.0.4 to get the fix for this bug. The second issue comes from using exec_recipe_fragment, which can only be used in recipe (as the name implies). Also, even if it had worked, you code would have extracted the table from hive and reloaded it into another dataset, effectively duplicating the data; this is probably not what you're looking for

esteban23 · September 2021

Hi! Got it. Thanks.

However, it's very unlikely I get to use version 9 given the fact that I'm not the admin, thus can't run the update. Do you know any other way I can solve the mentioned problem?

Problem when trying to create a new dataset in project with a Hive table (from Python API)

Answers

Categories

Setup Info

Tags