Jupyter Notebook and python logging in DSS

Solved!
tgb417
Jupyter Notebook and python logging in DSS

I've got a Jupyter Notebook setup to do a series of ETL and Writeback tasks that I want to Log.

I'm newish to both Python and Dataiku.

Wondering what approach folks would suggest for managing the output from a Jupyter Notebook in the DSS environment.

I currently have the results showing up on screen through a number of print statements on a cell.  They look sort of like this.

tgb417_0-1580347363090.png

 

Note the results form each step of the process is different.  So a fixed schema for each data element may be hard.

Trimming over time would be useful?  This job might run from 1 to say 144 times a day, indefinitely.  Early on I'm looking for detailed logs.  Later on, I'll likely want to reduce the flow. I may eventually establish login levels.  What you see above is likely a debug level. 

The volume will be relatively light with hundreds to maybe thousands of text rows when in Debug like this.  And maybe 10s to Hundreds of rows when running in a summary mode.

Question:

  1. What sort of storage approach would be suggested managed by DSS? 
    1. Database rows with Simple schema.  Say datestamp and output?
    2. File Folder somehow managed by DSS?
      1. Either as logs that grow to a certain size and rollover.  Or one file per day
  2. Are there Python libraries that are designed to make this easier?
    1. How do those libraries interact with DSS?
  3. Any other thought would be helpful.

Thanks.

--Tom ...

--Tom
0 Kudos
1 Solution
Clément_Stenac
Dataiker

Hi,

Both storing in a dataset or in a managed folder make sense. If these are indeed logs (meaning stuff that is useful to understand what happened, but not required as long as everything is working OK), a dataset may be a slightly atypical idea.

We would indeed recommend the use of Python's "logging" library which is designed exactly for that: your program emits multiple log items with an associated "severity" or "level" (debug, info, warning, error), and then the library will emit these log items to various outputs, with possible severity filters.

The logging library has multiple outputs. The default one is "console", meaning that it does the equivalent of "print", so it will appear directly in the notebook. But you could indeed create a "file" appender in Python logging, that would write into a managed folder file.

Assuming that you are using a local managed folder, that could look like:

logging.basicConfig(level=logging.DEBUG)
managed_folder_path = dataiku.Folder("myfolder").get_path()
logging.root.addHandler(logging.FileHandler(os.path.join(managed_folder_path, "log.log"))

logging.info("hello") <-- This will print both in console and in log.log in the managed folder
    

You can also play with formats in logging in order to make sure to print timestamps, severities, ...

However, for a process that you are going to repeat multiple times a day, we would recommend that you use DSS jobs rather than notebooks. DSS will automatically store the "console" output of each job, without need for you to manage it (but you would still use the logging library to dispatch by severity)

View solution in original post

7 Replies
tgb417
Author

Maybe, I'm asking:

How does the "standard" python logging library work within a DSS hosted Jupyter Notebook.

Where does the resulting content go?  Are there any value to using the standard library when working with DSS.

--Tom

--Tom
0 Kudos
Clément_Stenac
Dataiker

Hi,

Both storing in a dataset or in a managed folder make sense. If these are indeed logs (meaning stuff that is useful to understand what happened, but not required as long as everything is working OK), a dataset may be a slightly atypical idea.

We would indeed recommend the use of Python's "logging" library which is designed exactly for that: your program emits multiple log items with an associated "severity" or "level" (debug, info, warning, error), and then the library will emit these log items to various outputs, with possible severity filters.

The logging library has multiple outputs. The default one is "console", meaning that it does the equivalent of "print", so it will appear directly in the notebook. But you could indeed create a "file" appender in Python logging, that would write into a managed folder file.

Assuming that you are using a local managed folder, that could look like:

logging.basicConfig(level=logging.DEBUG)
managed_folder_path = dataiku.Folder("myfolder").get_path()
logging.root.addHandler(logging.FileHandler(os.path.join(managed_folder_path, "log.log"))

logging.info("hello") <-- This will print both in console and in log.log in the managed folder
    

You can also play with formats in logging in order to make sure to print timestamps, severities, ...

However, for a process that you are going to repeat multiple times a day, we would recommend that you use DSS jobs rather than notebooks. DSS will automatically store the "console" output of each job, without need for you to manage it (but you would still use the logging library to dispatch by severity)

tgb417
Author

@Clément_Stenac thanks for jumping in on this question.

I will check out the logging library.

I do intend to run the python code as a python code recipe. 

Do I understand correctly from your note that anything printed to the console will show up in a job library.  Or is it just stuff that gets emitted through the logging library that ends up in these logs?

When I sit down to coding later today.  I'll give this a try.

Thanks. 

--Tom
0 Kudos
tgb417
Author

Thanks.  I've discovered the Jupyter Notebooks kernels have to be re-started in order to change levels.

 

--Tom
0 Kudos
tgb417
Author

Using the logging library in Python seems to work fairly well for me. This just puts my logging results straight to the recipe job result. I can change the level of logging in the standard way for this library.

This is a good solution.

@Clément_Stenac thanks for your support.

--Tom

--Tom
0 Kudos
tgb417
Author

@Clément_Stenac 

What is the default level of Logging in DSS?

I've set:

##setup logging level
import logging
logging.basicConfig(
level=logging.WARNING
)

 However, I'm getting the following in the logs

Showing down to Debug Level.

tgb417_0-1580758896704.png

I'm not going to need this level of detail at most times when running live.

In a Jupyter Notebook, the level seems to be followed as written.

I've decided that I don't need separate files the standard logs should be good enough.

--Tom
0 Kudos

Hi @Clément_Stenac!
Thanks for the solution. Is there a way to write the user-defined logs of a python job to a file as described below when running the job inside kubernetes?  Now there appears a FileNotFoundError.
Cheers,
Pauline

0 Kudos