Text encoding different for Python script and Notebook
Hi,
When I run a python script in the notebook editor with the same Python environment as the default environment (python 2) and save the results to a new dataset (CSV) using:
dataiku_dataset = dataiku.Dataset("Demo_Api_REST_V1_Import")
dataiku_dataset.write_with_schema(new_dataset, dropAndCreate=True)
Then the new dataset is correctly encoded with UTF-8 and contains no u'data' or, in my case and the main problem, json arrays with {u'key': u'value'} which totally mess up my JSON prepare recipes afterwards.
But when I run this using either the flow or the python script editor (not the notebook one), all my data gets saved as unicode u'strings'.
It's the exact same script with the same environments and result dataset (no input dataset, since that's what this script is doing). Needless to say I am enforcing any way of UTF-8 encoding I can possibly think of.
while (next_url != None):
api_url = next_url
response = requests.get(api_url, headers=headers)
json_response = json.loads(response.content.decode('utf-8'), encoding="utf-8")
if (json_response.get('_links') != None and json_response.get('_links').get('next')) != None:
next_url = json_response.get('_links').get('next').get('href')
else:
next_url = None
if (json_response.get('_embedded') != None and json_response.get('_embedded').get('items')):
if (akeneoDf is None):
akeneoDf = pd.read_json(json.dumps(json_response.get('_embedded').get('items')), encoding="utf-8")
else:
akeneoDf = akeneoDf.append(pd.read_json(json.dumps(json_response.get('_embedded').get('items')), encoding="utf-8"))
How come?
Best Answer
-
Hi,
Encoding is handled differently in Python 2 and Python 3.
However, when you execute the exact same Python code with the exact same version of Python and libraries, and then write the pandas dataframe output to a Dataiku dataset using the same method (write_with_schema for instance) there cannot be any difference in the Dataiku dataset output, whether it is run from a recipe or a notebook. This assumes that you compare the two in the Dataiku dataset view interface, after refreshing the sample:
However, if you compare the pandas dataframe view you get in the notebook (before writing to a dataset) with the dataset sample view, there can be small differences, in particular with encoding. This is expected by design as a pandas dataframe is a Python in-memory object while a dataset in Dataiku is physically stored as a text file or database table. Pandas dataframe and Dataiku datasets are two different types of objects.
I hope it helps to clarify the matter.
Cheers,
Alex