Logging in dataiku notebook / recipe ...

Ravan · December 20

Hello Team,

I am working on pyspark recipes. I use notebook to build the logic and change it back into recipe.

The dataiku and spark operations ( e.g. df.count() ) emits a lot of log statements to the console and makes the notebook very difficult to use.

Is there a way for me to supress logging from dataku and spark APIs?

Btw, I ran the snipped "sc.setLogLevel('ERROR')"

Operating system used: Linux

JordanB · December 20

Hi @Ravan

You can use the following code to reduce the verbosity. The first cell containing this code will be verbose, but the rest won't:

sc = pyspark.SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
dkuspark.__dataikuSparkContext(sqlContext._jvm)
sc.setLogLevel("WARN")

We will be looking into improving this.

Thanks!

Turribeach · 10:07AM

I would also argue that you should clean your notebook before it's ready to run as Production code. For instance in a notebook you may run statements df.count()/df.info()/df.show()/df.head()/print()/etc to check the contents of a Pandas data frame, debug code, etc as you are developing your code. These statements take time to execute, generate output that needs to be transmitted and are useless in a non-interactive execution like a recipe running in a scenario. They can also go against PySpark Lazy Evaluation making your code execute slower.

PS: Don’t use df.count() when you don’t need to return the exact number of rows. To check if data frame is empty, len(df.head(1))>0 or df.head(1).isEmpty or simply df.empty() if you ar ein Pandas as it will be much faster.

Logging in dataiku notebook / recipe ...

Answers

Categories

Setup Info

Tags