Do you know the History of Data Science? READ MORE

Files in Folders Schema Inconsistencies

adamnieto
Neuron
Neuron
Files in Folders Schema Inconsistencies

I am currently working on a DSS project that is pulling DSS Event Server audit logs in order to help compute DSS usage metrics from cloud resources. 

Problem I am encountering: 

I am experiencing an issue with the "Files in Folders" feature as seen below. It seems that it looks at the first non-empty file in the filesystem tree, however, this file doesn't contain events that represent an exhaustive list of different type of events. 

How can I make DSS recognize the full schema of the DSS events that come from the event server? Does anyone know the full schema for each one of the event topics? Can I somehow force DSS to read this schema and propagate it fully across the pipeline right now it is just basing the schema off of the first non-empty file it finds and propagating that in the data sync recipe as seen in the picture below:

example.PNG

Essentially, does anyone at Dataiku have the data dictionaries or full schema/all potential fields for the clientEvents of each topic described in this part of the documentation: https://doc.dataiku.com/dss/latest/operations/audit-trail/data.html?highlight=events. 

Thank you for your help!

0 Kudos
3 Replies
tgb417
Neuron
Neuron

@adamnieto 

+1 here.  I've tried to do something similar I think.  I'm trying to evaluate the usage of a design node.

I've got a Server's Filesystem connection: audit_logs connected to a file path something like ~/dss/run/audit.

And I find some 130+ great data columns that can be parsed as Json.

And I'm finding that the schema changes in some way.  When I try to use downstream visual recipes to do basic clean up and loading into a SQL database for reporting. Many times when I try to load the dataset I end up with schema problems.

For example just now when trying to refresh the data set I go.

The current schema of output dataset(s) doesn't match what the recipe outputs.

Audit_Logs_prepared

Column name mismatch at position 71 ("message.computeResourceUsage.localProcess.cpuChildrenSystemTimeMS" was called "message.computeResourceUsage.localProcess.cpuChildrenSystemTime" previously).

That said the error message above is not the only error message.  But all seem to be related to Skema changes that DSS is seeing in the audit logs.

 

--Tom
adamnieto
Neuron
Neuron
Author

@tgb417 , thank you for your input!  I wonder if Dataiku can put in their documentation or provide an example of all of the properties an event can have for each topic so we can ensure along the way as we create our flow we are not losing any columns. So far for the "compute-resource-usage" topic I have 52 columns so maybe I am losing a lot more than I originally thought?

tgb417
Neuron
Neuron

@adamnieto ,

I would not be surprised if I'm doing something silly.

--Tom
0 Kudos
A banner prompting to get Dataiku DSS