UnicodeDecodeError when performs sql query with Python API

Oscar
Oscar Registered Posts: 19 ✭✭✭✭
edited July 16 in Using Dataiku

Hi everyone,

When I performs SQL query through dataiku Python API

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

it returns an

executor = SQLExecutor2(connection="postgres") 
executor.query_to_df('SELECT * from "somedb"')

I checked the column names and data, they are pure string or number without any accent. And I cannot find param to set encoding param for this connection.

Would you help me with this issue?

Thanks in advance.

Best Answer

  • Oscar
    Oscar Registered Posts: 19 ✭✭✭✭
    Answer ✓

    Sorry, I removed it because it was sensitive information.

    And I just tested dataikuapi.sql_query, and it works correctly. So maybe there is some error caused by dataiku.SQLExecutor2 ?

Answers

  • dima_naboka
    dima_naboka Dataiker, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts Posts: 28 Dataiker

    Hello, does your source DB use UTF-8 collation or something else, e.g. LATIN1?

    Screenshot 2020-09-11 at 18.35.19.png

    Are you using Python 3 as code environment in this project? If so, does switching to Python 2 produce different results?

  • Oscar
    Oscar Registered Posts: 19 ✭✭✭✭

    Collation checked, is en_US.UTF-8

    I was using Python3, it gave me the UnicodeDecodeError. If I switch to Python2, then it shows as follow: 

    Screenshot 2020-09-14 at 12.13.35.png

  • dima_naboka
    dima_naboka Dataiker, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts Posts: 28 Dataiker

    Hm, that's weird. Could you let me know more details about affected Table please. What are the column types and their content? What is the output in PgAdmin?

  • Oscar
    Oscar Registered Posts: 19 ✭✭✭✭

    Hi, all tables are affected. I even test the hello world query ¨select 1 ;¨ and it returns the same error. The error only happens in Python environment, either PgAdmin or Dataiku SQL notebook can execute the query perfectly.

  • dima_naboka
    dima_naboka Dataiker, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts Posts: 28 Dataiker
    edited July 17

    Hi, I wasn't able to reproduce your issue so far. I tried changing Python locale as well as Postgres DB locale/collation.

    Still, I suspect locale settings might be related. Another possibility is that you have conflicting system locales on DSS server and PostgreSQL nodes (in case those are hosted on different nodes of course) . Hence, could you please

    1. provide output of "locale" command line output from your DSS and PostgreSQL node(s).
    2. your full code (if possible so I can compare it with mine) and Python 'locale' output
    import locale
    print (locale.getdefaultlocale(),'\n',locale.getlocale(),'\n',locale.getpreferredencoding())

    Screenshot 2020-09-17 at 17.47.40.pngP.s. For full code export you could download notebook like this

    Screenshot 2020-09-17 at 17.51.11.png

  • Oscar
    Oscar Registered Posts: 19 ✭✭✭✭

    Here you have the notebook with its output.

    It seems the forum system doesn't allow me to upload .zip, I created a gist you can see the same notebook in here:

    https://gist.github.com/Qiaorui/97b87c059f040699199df3a1d4eafc88

  • dima_naboka
    dima_naboka Dataiker, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts Posts: 28 Dataiker

    thanks, I still can't reproduce this. What is the output of 'locale' cmd command on your DSS server and Postgres server?

    Also, do you have any custom settings set for "postgres" connection? Here is mine

    Screenshot 2020-09-21 at 20.22.13.png

  • Oscar
    Oscar Registered Posts: 19 ✭✭✭✭
    edited July 17

    I don't have any custom settings in the connection. And the output of locale command are all

    'en_US', 'UTF-8'
  • dima_naboka
    dima_naboka Dataiker, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts Posts: 28 Dataiker
    edited July 17

    thanks, we still think this is environment-related issue; we wonder if there is a proxy involved, hence could you please include next lines before "import" code and provide the output

    import os
    print(os.environ)
    if "http_proxy" in os.environ:
      del os.environ["http_proxy"]
    if "HTTP_PROXY" in os.environ:
      del os.environ["HTTP_PROXY"]
  • Oscar
    Oscar Registered Posts: 19 ✭✭✭✭

    Hi tested, there is no proxy on environment. But I found something interesting.

    When I do

    dataiku.set_remote_dss in dataiku notebook and execute the query , it will return the error.

    However if I run it on dataiku notebook (same notebook as previous one) without dataiku.set_remote_dss, it executes correctly.

    My goal is to use dataiku SQL query on dataiku user API, I cannot do this without set_remote_dss

  • dima_naboka
    dima_naboka Dataiker, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts Posts: 28 Dataiker

    Hello, I don't see 'set_remote_dss' in https://gist.github.com/Qiaorui/97b87c059f040699199df3a1d4eafc88

    Could you please update the file or provide a new link with working vs not working code?

  • dima_naboka
    dima_naboka Dataiker, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts Posts: 28 Dataiker

    So, do I read this correctly that your code works fine in Notebook but fails when being executed outside DSS? If so, you should indeed be using sql_query() for code outside DSS as stated in the doc: "This function was initially designed for usage outside of DSS and only supports returning results as an iterator. It does not support pandas dataframe"
    By the way, did you compare “os.environ” output for both notebook (datiaku package) and dataikuapi package code executed outside DSS?
  • Oscar
    Oscar Registered Posts: 19 ✭✭✭✭

    For your question, all codes are executed in the same dataiku notebook. The only difference is shown as below table.

    libset remote dssresult
    dataikunoworks, but can only applied inside DSS
    dataikuyesfailed
    dataikuapiyesworks

  • dima_naboka
    dima_naboka Dataiker, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts Posts: 28 Dataiker

    I'm sorry, I don't understand why do you need to use set_remote_dss() inside a Notebook? Are you connecting to remote DSS instance from another DSS instance?

    If not, you should be able to simply do client = dataiku.api_client() to get a handle of current DSS instance as per doc

  • Oscar
    Oscar Registered Posts: 19 ✭✭✭✭

    Because I want to execute the SQL_executor2 inside python API, however the API is considered as outside of DSS instance, that's why I'm testing this with notebook

  • dima_naboka
    dima_naboka Dataiker, Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS Core Concepts Posts: 28 Dataiker

    Could you please open a support ticket (https://doc.dataiku.com/dss/latest/troubleshooting/obtaining-support.html#editor-support-for-dataiku-customers) and attach a diagnostic of the DSS instance (Administration > Maintenance > Diagnostic tool)?

    Note that you need to be administrator of the DSS instance - else you'll need to ask your admin

    If the resulting file is too large for the suppoprt portal (> 15 MB), you can use https://dl.dataiku.com to send it to us. Please don't forget to send the link that is generated when you upload the file

Setup Info
    Tags
      Help me…