AttributeError: 'Dataset' object has no attribute '_sc_'
Hi, this is something for which I would normally make a pull request, but you don't have a public API. I therefore thought it best if I created a bug report here instead.
Problem:
If one accidentally (or by means of code) passes an object to the `write_with_schema` function in dataiku.spark that isn’t a spark dataframe, the underlying code tries to access the spark context within that assumed dataframe, and crashes with an internal Dataiku error:
[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils] - File "/opt/dataiku-dss-5.1.2/python/dataiku/spark/__init__.py", line 139, in write_with_schema
[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils] - write_schema_from_dataframe(dataset, dataframe)
[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils] - File "/opt/dataiku-dss-5.1.2/python/dataiku/spark/__init__.py", line 122, in write_schema_from_dataframe
[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils] - dsc = __dataikuSparkContext(dataframe._sc._jvm)
[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils] - AttributeError: 'Dataset' object has no attribute '_sc'
This can happen easily if you have a function that returns a None type which gets passed to the writer instead of a dataframe, resulting in the same kind of AttributeError.
Solution:
A single line that asserts that the `dataframe` object is a spark dataframe could be added just before dataiku/spark/__init__.py line 122, where it tries to access the underlying spark context. A TypeError exception would offer a little more help to the user than the current stacktrace.
Answers
-
Thank you very much for this report and investigating a solution. I'll pass this information to the development team.