Files in Folders Schema Inconsistencies

Options
adamnieto
adamnieto Neuron 2020, Neuron, Registered, Neuron 2021, Neuron 2022, Neuron 2023 Posts: 87 Neuron

I am currently working on a DSS project that is pulling DSS Event Server audit logs in order to help compute DSS usage metrics from cloud resources.

Problem I am encountering:

I am experiencing an issue with the "Files in Folders" feature as seen below. It seems that it looks at the first non-empty file in the filesystem tree, however, this file doesn't contain events that represent an exhaustive list of different type of events.

How can I make DSS recognize the full schema of the DSS events that come from the event server? Does anyone know the full schema for each one of the event topics? Can I somehow force DSS to read this schema and propagate it fully across the pipeline right now it is just basing the schema off of the first non-empty file it finds and propagating that in the data sync recipe as seen in the picture below:

example.PNG

Essentially, does anyone at Dataiku have the data dictionaries or full schema/all potential fields for the clientEvents of each topic described in this part of the documentation: https://doc.dataiku.com/dss/latest/operations/audit-trail/data.html?highlight=events.

Thank you for your help!

Answers

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    edited July 17
    Options

    @adamnieto

    +1 here. I've tried to do something similar I think. I'm trying to evaluate the usage of a design node.

    I've got a Server's Filesystem connection: audit_logs connected to a file path something like ~/dss/run/audit.

    And I find some 130+ great data columns that can be parsed as Json.

    And I'm finding that the schema changes in some way. When I try to use downstream visual recipes to do basic clean up and loading into a SQL database for reporting. Many times when I try to load the dataset I end up with schema problems.

    For example just now when trying to refresh the data set I go.

    The current schema of output dataset(s) doesn't match what the recipe outputs.

    Audit_Logs_prepared

    Column name mismatch at position 71 ("message.computeResourceUsage.localProcess.cpuChildrenSystemTimeMS" was called "message.computeResourceUsage.localProcess.cpuChildrenSystemTime" previously).

    That said the error message above is not the only error message. But all seem to be related to Skema changes that DSS is seeing in the audit logs.

  • adamnieto
    adamnieto Neuron 2020, Neuron, Registered, Neuron 2021, Neuron 2022, Neuron 2023 Posts: 87 Neuron
    Options

    @tgb417
    , thank you for your input! I wonder if Dataiku can put in their documentation or provide an example of all of the properties an event can have for each topic so we can ensure along the way as we create our flow we are not losing any columns. So far for the "compute-resource-usage" topic I have 52 columns so maybe I am losing a lot more than I originally thought?

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,595 Neuron
    Options

    @adamnieto
    ,

    I would not be surprised if I'm doing something silly.

Setup Info
    Tags
      Help me…