AttributeError: 'Dataset' object has no attribute '_sc_'

Options
jmccartin
jmccartin Registered Posts: 19 ✭✭✭✭
edited July 18 in Using Dataiku

Hi, this is something for which I would normally make a pull request, but you don't have a public API. I therefore thought it best if I created a bug report here instead.

Problem:

If one accidentally (or by means of code) passes an object to the `write_with_schema` function in dataiku.spark that isn’t a spark dataframe, the underlying code tries to access the spark context within that assumed dataframe, and crashes with an internal Dataiku error:


[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils] - File "/opt/dataiku-dss-5.1.2/python/dataiku/spark/__init__.py", line 139, in write_with_schema

[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils] - write_schema_from_dataframe(dataset, dataframe)

[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils] - File "/opt/dataiku-dss-5.1.2/python/dataiku/spark/__init__.py", line 122, in write_schema_from_dataframe

[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils] - dsc = __dataikuSparkContext(dataframe._sc._jvm)

[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils] - AttributeError: 'Dataset' object has no attribute '_sc'


This can happen easily if you have a function that returns a None type which gets passed to the writer instead of a dataframe, resulting in the same kind of AttributeError.

Solution:

A single line that asserts that the `dataframe` object is a spark dataframe could be added just before dataiku/spark/__init__.py line 122, where it tries to access the underlying spark context. A TypeError exception would offer a little more help to the user than the current stacktrace.

Answers

Setup Info
    Tags
      Help me…