Problem writing JSON data to "HDFS_managed" dataset storage

KjellK · May 2017

I have an input dataset which contains one column of JSON data, which needs to be:

folded into several rows,
unpacked into separate columns.

This works fine when the output data set is "filesystem_managed", but if the output data set is "HDFS_managed", I get a long list of errors and no output. From the log I see:


[11:56:34] [WARN] [com.dataiku.dip.input.formats.parquet.ParquetOutputWriter] - OUTPUT_DATA_BAD_TYPE: Unable to write row 3 to Parquet: Failed to write in column usage (content:{"duration":4,"start_time":1495161529377,"package_name":"com.sonyericsson.home","count":1,"origin_google_play":false}): A JSONArray text must start with '[' at 1 [character 2 line 1]

java.io.IOException: Failed to write in column usage (content:{"duration":4,"start_time":1495161529377,"package_name":"com.sonyericsson.home","count":1,"origin_google_play":false}): A JSONArray text must start with '[' at 1 [character 2 line 1]

Since the JSON data is read correctly from the origin file set, it seems strange that it cannot be written back in the same way. However, I wounder about the error text above "A JSONArray text must start with '[' at 1 [character 2 line 1]". I have no idea how the string is stored, but it appears that index "1" is character "2". In the string I have, the FIRST character is "[", so could it be that there is some mismatch between how the string is stored internally and how the write function is implemented? Obviously, since the original data can be unpacked with the "unnest" processor, at least that function does not have any problems interpreting the JSON data correctly, so the issue seems to be with the "write to hdfs_managed" functionality.

Clément_Stenac · May 2017

It seems that the storage type here is "array" like [1,2,3] but the data is actually an "object" with multiple keys like {"a" : 1, "b" : 2}

Note that if you want to write arrays or objects in Parquet, you need to fully specify all of the nested types in the dataset schema (the nested types are not automatically inferred for you). The simplest for you would be to change the storage type of this column "usage" to "string" (either in the prepare recipe editor or in the output dataset schema editor)

You may want also to review https://doc.dataiku.com/dss/latest/schemas/

KjellK · May 2017

Thanks, I'll take a look at this.

Problem writing JSON data to "HDFS_managed" dataset storage

Answers

Categories

Setup Info

Tags