I've got a Jupyter Notebook setup to do a series of ETL and Writeback tasks that I want to Log.
I'm newish to both Python and Dataiku.
Wondering what approach folks would suggest for managing the output from a Jupyter Notebook in the DSS environment.
I currently have the results showing up on screen through a number of print statements on a cell. They look sort of like this.
Note the results form each step of the process is different. So a fixed schema for each data element may be hard.
Trimming over time would be useful? This job might run from 1 to say 144 times a day, indefinitely. Early on I'm looking for detailed logs. Later on, I'll likely want to reduce the flow. I may eventually establish login levels. What you see above is likely a debug level.
The volume will be relatively light with hundreds to maybe thousands of text rows when in Debug like this. And maybe 10s to Hundreds of rows when running in a summary mode.
Maybe, I'm asking:
How does the "standard" python logging library work within a DSS hosted Jupyter Notebook.
Where does the resulting content go? Are there any value to using the standard library when working with DSS.
Both storing in a dataset or in a managed folder make sense. If these are indeed logs (meaning stuff that is useful to understand what happened, but not required as long as everything is working OK), a dataset may be a slightly atypical idea.
We would indeed recommend the use of Python's "logging" library which is designed exactly for that: your program emits multiple log items with an associated "severity" or "level" (debug, info, warning, error), and then the library will emit these log items to various outputs, with possible severity filters.
The logging library has multiple outputs. The default one is "console", meaning that it does the equivalent of "print", so it will appear directly in the notebook. But you could indeed create a "file" appender in Python logging, that would write into a managed folder file.
Assuming that you are using a local managed folder, that could look like:
logging.basicConfig(level=logging.DEBUG) managed_folder_path = dataiku.Folder("myfolder").get_path() logging.root.addHandler(logging.FileHandler(os.path.join(managed_folder_path, "log.log")) logging.info("hello") <-- This will print both in console and in log.log in the managed folder
You can also play with formats in logging in order to make sure to print timestamps, severities, ...
However, for a process that you are going to repeat multiple times a day, we would recommend that you use DSS jobs rather than notebooks. DSS will automatically store the "console" output of each job, without need for you to manage it (but you would still use the logging library to dispatch by severity)
@Clément_Stenac thanks for jumping in on this question.
I will check out the logging library.
I do intend to run the python code as a python code recipe.
Do I understand correctly from your note that anything printed to the console will show up in a job library. Or is it just stuff that gets emitted through the logging library that ends up in these logs?
When I sit down to coding later today. I'll give this a try.
Using the logging library in Python seems to work fairly well for me. This just puts my logging results straight to the recipe job result. I can change the level of logging in the standard way for this library.
This is a good solution.
@Clément_Stenac thanks for your support.
What is the default level of Logging in DSS?
##setup logging level
However, I'm getting the following in the logs
Showing down to Debug Level.
I'm not going to need this level of detail at most times when running live.
In a Jupyter Notebook, the level seems to be followed as written.
I've decided that I don't need separate files the standard logs should be good enough.