Jupyter Notebook and python logging in DSS

tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

I've got a Jupyter Notebook setup to do a series of ETL and Writeback tasks that I want to Log.

I'm newish to both Python and Dataiku.

Wondering what approach folks would suggest for managing the output from a Jupyter Notebook in the DSS environment.

I currently have the results showing up on screen through a number of print statements on a cell. They look sort of like this.

tgb417_0-1580347363090.png

Note the results form each step of the process is different. So a fixed schema for each data element may be hard.

Trimming over time would be useful? This job might run from 1 to say 144 times a day, indefinitely. Early on I'm looking for detailed logs. Later on, I'll likely want to reduce the flow. I may eventually establish login levels. What you see above is likely a debug level.

The volume will be relatively light with hundreds to maybe thousands of text rows when in Debug like this. And maybe 10s to Hundreds of rows when running in a summary mode.

Question:

  1. What sort of storage approach would be suggested managed by DSS?
    1. Database rows with Simple schema. Say datestamp and output?
    2. File Folder somehow managed by DSS?
      1. Either as logs that grow to a certain size and rollover. Or one file per day
  2. Are there Python libraries that are designed to make this easier?
    1. How do those libraries interact with DSS?
  3. Any other thought would be helpful.

Thanks.

--Tom ...

Best Answer

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    edited July 17 Answer ✓

    Hi,

    Both storing in a dataset or in a managed folder make sense. If these are indeed logs (meaning stuff that is useful to understand what happened, but not required as long as everything is working OK), a dataset may be a slightly atypical idea.

    We would indeed recommend the use of Python's "logging" library which is designed exactly for that: your program emits multiple log items with an associated "severity" or "level" (debug, info, warning, error), and then the library will emit these log items to various outputs, with possible severity filters.

    The logging library has multiple outputs. The default one is "console", meaning that it does the equivalent of "print", so it will appear directly in the notebook. But you could indeed create a "file" appender in Python logging, that would write into a managed folder file.

    Assuming that you are using a local managed folder, that could look like:

    logging.basicConfig(level=logging.DEBUG)
    managed_folder_path = dataiku.Folder("myfolder").get_path()
    logging.root.addHandler(logging.FileHandler(os.path.join(managed_folder_path, "log.log"))
    
    logging.info("hello") <-- This will print both in console and in log.log in the managed folder
        

    You can also play with formats in logging in order to make sure to print timestamps, severities, ...

    However, for a process that you are going to repeat multiple times a day, we would recommend that you use DSS jobs rather than notebooks. DSS will automatically store the "console" output of each job, without need for you to manage it (but you would still use the logging library to dispatch by severity)

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    Maybe, I'm asking:

    How does the "standard" python logging library work within a DSS hosted Jupyter Notebook.

    Where does the resulting content go? Are there any value to using the standard library when working with DSS.

    --Tom

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    @Clément_Stenac
    thanks for jumping in on this question.

    I will check out the logging library.

    I do intend to run the python code as a python code recipe.

    Do I understand correctly from your note that anything printed to the console will show up in a job library. Or is it just stuff that gets emitted through the logging library that ends up in these logs?

    When I sit down to coding later today. I'll give this a try.

    Thanks.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    Thanks. I've discovered the Jupyter Notebooks kernels have to be re-started in order to change levels.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron

    Using the logging library in Python seems to work fairly well for me. This just puts my logging results straight to the recipe job result. I can change the level of logging in the standard way for this library.

    This is a good solution.

    @Clément_Stenac
    thanks for your support.

    --Tom

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,601 Neuron
    edited July 17

    @Clément_Stenac

    What is the default level of Logging in DSS?

    I've set:

    ##setup logging level
    import logging
    logging.basicConfig(
    level=logging.WARNING
    )

    However, I'm getting the following in the logs

    Showing down to Debug Level.

    tgb417_0-1580758896704.png

    I'm not going to need this level of detail at most times when running live.

    In a Jupyter Notebook, the level seems to be followed as written.

    I've decided that I don't need separate files the standard logs should be good enough.

  • pvannies
    pvannies Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron, Dataiku DSS Adv Designer, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Frontrunner 2022 Participant, Neuron 2023 Posts: 18 Neuron

    Hi @Clément_Stenac!
    Thanks for the solution. Is there a way to write the user-defined logs of a python job to a file as described below when running the job inside kubernetes? Now there appears a FileNotFoundError.
    Cheers,
    Pauline

Setup Info
    Tags
      Help me…