JSON on hadoop ?

q666
Level 1
JSON on hadoop ?

This is related to my last question, i'm still not convinced that there is a full json support ...



So to recreate the problem with a simpler and valid json

echo -e "{"foo": 123, "bar": 444}\n{"foo": 111, "bar": 321}" > simple_valid_json

hdfs dfs -put simple_valid_json



and then i'm able to create a simple_valid_json_dataset dataset via DSS ... but when i want to do something with it...



mydataset = dataiku.Dataset("simple_valid_json_dataset")

df = dkuspark.get_dataframe(sqlContext, mydataset)

df.count() -> returns an exception !




Py4JJavaError: An error occurred while calling o22.count.
: java.lang.RuntimeException: Unsupported input format : json
at com.dataiku.dip.shaker.mrimpl.formats.UniversalFileInputFormat.lazyInit(UniversalFileInputFormat.java:93)
at com.dataiku.dip.shaker.mrimpl.formats.UniversalFileInputFormat.getSplits(UniversalFileInputFormat.java:10


 

0 Kudos
5 Replies
UserBird
Dataiker
Hi q666,

There are several issues with that json:
- You need to escape the quotes for the keys.
- For arrays, you need to have a comma separated list, surrounded by square brackets [].

With this corrected version, there should not be any issues:
echo -e "[{\"foo\": 123, \"bar\": 444},\n{\"foo\": 111, \"bar\": 321}]" > simple_valid_json
0 Kudos
q666
Level 1
Author
ah i see so u expect one single json object in the file... and not as spark does with multiple self contained json objects separate with newline ...

http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

its a pity that i can't use DSS when i have one file with multiple self contained json objects as in Spark ...
0 Kudos
q666
Level 1
Author
i could use the syntax sqlContext.read.format("json").load("simple_json") in pySpark, ignoring the input as dataset from dss but thats not so nice isn't it ๐Ÿ˜›
0 Kudos
jereze
Community Manager
Community Manager
Can you try with a "one row per line" instead of "json"? I'm not sure but this might work.
Jeremy, Product Manager at Dataiku
0 Kudos
q666
Level 1
Author
tried, it works ! ๐Ÿ™‚ but i need to parse "flatten" the json later so the process is a bit slower then using a pyspark job that connect directly to the source on hdfs
0 Kudos