Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi, this is something for which I would normally make a pull request, but you don't have a public API. I therefore thought it best if I created a bug report here instead.
Problem:
If one accidentally (or by means of code) passes an object to the `write_with_schema` function in dataiku.spark that isn’t a spark dataframe, the underlying code tries to access the spark context within that assumed dataframe, and crashes with an internal Dataiku error:
[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils] - File "/opt/dataiku-dss-5.1.2/python/dataiku/spark/__init__.py", line 139, in write_with_schema
[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils] - write_schema_from_dataframe(dataset, dataframe)
[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils] - File "/opt/dataiku-dss-5.1.2/python/dataiku/spark/__init__.py", line 122, in write_schema_from_dataframe
[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils] - dsc = __dataikuSparkContext(dataframe._sc._jvm)
[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils] - AttributeError: 'Dataset' object has no attribute '_sc'
This can happen easily if you have a function that returns a None type which gets passed to the writer instead of a dataframe, resulting in the same kind of AttributeError.
Solution:
A single line that asserts that the `dataframe` object is a spark dataframe could be added just before dataiku/spark/__init__.py line 122, where it tries to access the underlying spark context. A TypeError exception would offer a little more help to the user than the current stacktrace.