Jupyter Notebook and python logging in DSS

tgb417 · ‎01-30-2020

I've got a Jupyter Notebook setup to do a series of ETL and Writeback tasks that I want to Log.

I'm newish to both Python and Dataiku.

Wondering what approach folks would suggest for managing the output from a Jupyter Notebook in the DSS environment.

I currently have the results showing up on screen through a number of print statements on a cell. They look sort of like this.

Note the results form each step of the process is different. So a fixed schema for each data element may be hard.

Trimming over time would be useful? This job might run from 1 to say 144 times a day, indefinitely. Early on I'm looking for detailed logs. Later on, I'll likely want to reduce the flow. I may eventually establish login levels. What you see above is likely a debug level.

The volume will be relatively light with hundreds to maybe thousands of text rows when in Debug like this. And maybe 10s to Hundreds of rows when running in a summary mode.

Question:

What sort of storage approach would be suggested managed by DSS?
1. Database rows with Simple schema. Say datestamp and output?
2. File Folder somehow managed by DSS?
  1. Either as logs that grow to a certain size and rollover. Or one file per day
Are there Python libraries that are designed to make this easier?
1. How do those libraries interact with DSS?
Any other thought would be helpful.

Thanks.

--Tom ...

--Tom

Clément_Stenac · ‎01-30-2020

Hi,

Both storing in a dataset or in a managed folder make sense. If these are indeed logs (meaning stuff that is useful to understand what happened, but not required as long as everything is working OK), a dataset may be a slightly atypical idea.

We would indeed recommend the use of Python's "logging" library which is designed exactly for that: your program emits multiple log items with an associated "severity" or "level" (debug, info, warning, error), and then the library will emit these log items to various outputs, with possible severity filters.

The logging library has multiple outputs. The default one is "console", meaning that it does the equivalent of "print", so it will appear directly in the notebook. But you could indeed create a "file" appender in Python logging, that would write into a managed folder file.

Assuming that you are using a local managed folder, that could look like:

logging.basicConfig(level=logging.DEBUG)
managed_folder_path = dataiku.Folder("myfolder").get_path()
logging.root.addHandler(logging.FileHandler(os.path.join(managed_folder_path, "log.log"))

logging.info("hello") <-- This will print both in console and in log.log in the managed folder

You can also play with formats in logging in order to make sure to print timestamps, severities, ...

However, for a process that you are going to repeat multiple times a day, we would recommend that you use DSS jobs rather than notebooks. DSS will automatically store the "console" output of each job, without need for you to manage it (but you would still use the logging library to dispatch by severity)

View solution in original post

tgb417 · ‎01-30-2020

Maybe, I'm asking:

How does the "standard" python logging library work within a DSS hosted Jupyter Notebook.

Where does the resulting content go? Are there any value to using the standard library when working with DSS.

--Tom

Clément_Stenac · ‎01-30-2020

Hi,

Both storing in a dataset or in a managed folder make sense. If these are indeed logs (meaning stuff that is useful to understand what happened, but not required as long as everything is working OK), a dataset may be a slightly atypical idea.

We would indeed recommend the use of Python's "logging" library which is designed exactly for that: your program emits multiple log items with an associated "severity" or "level" (debug, info, warning, error), and then the library will emit these log items to various outputs, with possible severity filters.

The logging library has multiple outputs. The default one is "console", meaning that it does the equivalent of "print", so it will appear directly in the notebook. But you could indeed create a "file" appender in Python logging, that would write into a managed folder file.

Assuming that you are using a local managed folder, that could look like:

logging.basicConfig(level=logging.DEBUG)
managed_folder_path = dataiku.Folder("myfolder").get_path()
logging.root.addHandler(logging.FileHandler(os.path.join(managed_folder_path, "log.log"))

logging.info("hello") <-- This will print both in console and in log.log in the managed folder

You can also play with formats in logging in order to make sure to print timestamps, severities, ...

However, for a process that you are going to repeat multiple times a day, we would recommend that you use DSS jobs rather than notebooks. DSS will automatically store the "console" output of each job, without need for you to manage it (but you would still use the logging library to dispatch by severity)

tgb417 · ‎01-30-2020

@Clément_Stenac thanks for jumping in on this question.

I will check out the logging library.

I do intend to run the python code as a python code recipe.

Do I understand correctly from your note that anything printed to the console will show up in a job library. Or is it just stuff that gets emitted through the logging library that ends up in these logs?

When I sit down to coding later today. I'll give this a try.

Thanks.

--Tom

tgb417 · ‎01-30-2020

Thanks. I've discovered the Jupyter Notebooks kernels have to be re-started in order to change levels.

--Tom

tgb417 · ‎01-31-2020

Using the logging library in Python seems to work fairly well for me. This just puts my logging results straight to the recipe job result. I can change the level of logging in the standard way for this library.

This is a good solution.

@Clément_Stenac thanks for your support.

--Tom

tgb417 · ‎02-03-2020

@Clément_Stenac

What is the default level of Logging in DSS?

I've set:

##setup logging level
import logging
logging.basicConfig(
    level=logging.WARNING
)

However, I'm getting the following in the logs

Showing down to Debug Level.

I'm not going to need this level of detail at most times when running live.

In a Jupyter Notebook, the level seems to be followed as written.

I've decided that I don't need separate files the standard logs should be good enough.

--Tom

pvannies · ‎09-17-2021

Hi @Clément_Stenac!
Thanks for the solution. Is there a way to write the user-defined logs of a python job to a file as described below when running the job inside kubernetes? Now there appears a FileNotFoundError.
Cheers,
Pauline

Sign up to take part

Jupyter Notebook and python logging in DSS

Jupyter Notebook and python logging in DSS