Community Conundrum 25: Feature Visualization is now live! Read More

UnicodeDecodeError when performs sql query with Python API

Level 3
UnicodeDecodeError when performs sql query with Python API

Hi everyone,

     When I performs SQL query through dataiku Python API

executor = SQLExecutor2(connection="postgres") 
executor.query_to_df('SELECT * from "somedb"')

it returns an 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

     I checked the column names and data, they are pure string or number without any accent. And I cannot find param to set encoding param for this connection.

    Would you help me with this issue?

Thanks in advance.

0 Kudos
17 Replies
Dataiker
Dataiker

Hello, does your source DB use UTF-8 collation or something else, e.g. LATIN1?

Screenshot 2020-09-11 at 18.35.19.png

Are you using Python 3 as code environment in this project? If so, does switching to Python 2 produce different results?

0 Kudos
Level 3
Author

Collation checked, is en_US.UTF-8

I was using Python3, it gave me the UnicodeDecodeError. If I switch to Python2, then it shows as follow: 

Screenshot 2020-09-14 at 12.13.35.png

0 Kudos
Dataiker
Dataiker

Hm, that's weird. Could you let me know more details about affected Table please. What are the column types and their content? What is the output in PgAdmin?

0 Kudos
Level 3
Author

Hi, all tables are affected. I even test the hello world query ¨select 1 ;¨ and it returns the same error. The error only happens in Python environment, either PgAdmin or Dataiku SQL notebook can execute the query perfectly.

0 Kudos
Dataiker
Dataiker

Hi, I wasn't able to reproduce your issue so far. I tried changing Python locale as well as Postgres DB locale/collation.

Still, I suspect locale settings might be related. Another possibility is that you have conflicting system locales on DSS server and PostgreSQL nodes (in case those are hosted on different nodes of course) . Hence, could you please

  1. provide output of "locale" command line output from your DSS and PostgreSQL node(s).
  2. your full code (if possible so I can compare it with mine) and Python 'locale' output
import locale
print (locale.getdefaultlocale(),'\n',locale.getlocale(),'\n',locale.getpreferredencoding())

Screenshot 2020-09-17 at 17.47.40.pngP.s. For full code export you could download notebook like this

Screenshot 2020-09-17 at 17.51.11.png

0 Kudos
Level 3
Author

Here you have the notebook with its output.

 

It seems the forum system doesn't allow me to upload .zip, I created a gist you can see the same notebook in here:

https://gist.github.com/Qiaorui/97b87c059f040699199df3a1d4eafc88

0 Kudos

thanks, I still can't reproduce this. What is the output of 'locale' cmd command on your DSS server and Postgres server?

Also, do you have any custom settings set for "postgres" connection? Here is mine

Screenshot 2020-09-21 at 20.22.13.png

 

0 Kudos
Level 3
Author

I don't have any custom settings in the connection. And the output of locale command are all 

'en_US', 'UTF-8'
0 Kudos

thanks, we still think this is environment-related issue; we wonder if there is a proxy involved, hence could you please include next lines  before "import" code and provide the output

import os
print(os.environ)
if "http_proxy" in os.environ:
  del os.environ["http_proxy"]
if "HTTP_PROXY" in os.environ:
  del os.environ["HTTP_PROXY"]
0 Kudos
Level 3
Author

Hi tested, there is no proxy on environment. But I found something interesting.

When I do 

dataiku.set_remote_dss in dataiku notebook and execute the query , it will return the error.

However if I run it on dataiku notebook (same notebook as previous one) without dataiku.set_remote_dss, it executes correctly.

My goal is to use dataiku SQL query on dataiku user API, I cannot do this without set_remote_dss 😞

0 Kudos

Hello, I don't see 'set_remote_dss' in https://gist.github.com/Qiaorui/97b87c059f040699199df3a1d4eafc88

Could you please update the file or provide a new link with working vs not working code?

0 Kudos
Level 3
Author

Sorry,  I removed it because it was sensitive information.

And I just tested dataikuapi.sql_query, and it works correctly. So maybe there is some error caused by dataiku.SQLExecutor2 ? 

0 Kudos

 

So, do I read this correctly that your code works fine in Notebook but fails when being executed outside DSS? If so, you should indeed be using sql_query() for code outside DSS as stated in the doc: "This function was initially designed for usage outside of DSS and only supports returning results as an iterator. It does not support pandas dataframe"
 
By the way, did you compare “os.environ” output for both notebook (datiaku package) and dataikuapi package code executed outside DSS? 
0 Kudos
Level 3
Author

For your question, all codes are executed in the same dataiku notebook. The only difference is shown as below table.

libset remote dssresult
dataikunoworks, but can only applied inside DSS
dataikuyesfailed
dataikuapiyesworks

 

0 Kudos

I'm sorry, I don't understand why do you need to use set_remote_dss() inside a Notebook? Are you connecting to remote DSS instance from another DSS instance?

If not, you should be able to simply do client = dataiku.api_client() to get a handle of current DSS instance as per doc

0 Kudos
Level 3
Author

Because I want to execute the SQL_executor2 inside python API, however the API is considered as outside of DSS instance, that's why I'm testing this with notebook

0 Kudos
Dataiker
Dataiker

Could you please open a support ticket (https://doc.dataiku.com/dss/latest/troubleshooting/obtaining-support.html#editor-support-for-dataiku...) and attach a diagnostic of the DSS instance (Administration > Maintenance > Diagnostic tool)?

Note that you need to be administrator of the DSS instance - else you'll need to ask your admin

If the resulting file is too large for the suppoprt portal (> 15 MB), you can use https://dl.dataiku.com to send it to us. Please don't forget to send the link that is generated when you upload the file

0 Kudos