Logging in dataiku notebook / recipe ...
Hello Team,
I am working on pyspark recipes. I use notebook to build the logic and change it back into recipe.
The dataiku and spark operations ( e.g. df.count() ) emits a lot of log statements to the console and makes the notebook very difficult to use.
Is there a way for me to supress logging from dataku and spark APIs?
Btw, I ran the snipped "sc.setLogLevel('ERROR')"
Operating system used: Linux
Operating system used: Linux
Answers
-
JordanB Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 297 Dataiker
Hi @Ravan
You can use the following code to reduce the verbosity. The first cell containing this code will be verbose, but the rest won't:
sc = pyspark.SparkContext.getOrCreate() sqlContext = SQLContext(sc) dkuspark.__dataikuSparkContext(sqlContext._jvm) sc.setLogLevel("WARN")
We will be looking into improving this.
Thanks!
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,160 Neuron
I would also argue that you should clean your notebook before it's ready to run as Production code. For instance in a notebook you may run statements df.count()/df.info()/df.show()/df.head()/print()/etc to check the contents of a Pandas data frame, debug code, etc as you are developing your code. These statements take time to execute, generate output that needs to be transmitted and are useless in a non-interactive execution like a recipe running in a scenario. They can also go against PySpark Lazy Evaluation making your code execute slower.
PS: Don’t use df.count() when you don’t need to return the exact number of rows. To check if data frame is empty, len(df.head(1))>0 or df.head(1).isEmpty or simply df.empty() if you ar ein Pandas as it will be much faster.