Contingency for Audit Logs (EventServer, Kafka, etc.)

daniel_adornes · February 2022

Hi everyone!

At my company, we had some questions about our audit logging configuration for our APIs running in production. The discussion might be a valuable addition to this community, and evolved as follows:

1. How we were set up since we started using Dataiku DSS:

a. Our Design Node had an Event Server installed and properly receiving and saving audit logs.

b. Our other two nodes (Automation Dev node and Automation Prod node) had no Event Server running on them.

2. The need that we identified:

a. We have already deployed some APIs in production, which are constantly receiving requests (up to ~100 requests per minute) and hence producing audit logs and sending them over to our event server (configured in our Design Node) which in turn saves the logs to our logs storage.

b. What the setup above entails is that whenever the Design Node is off, for any reason, we would lose the audit logs in the meantime. We may need to restart DSS in the Design Node (and lose audit logs), or restart the VM to upgrade its hardware (and lose audit logs).

3. Plan A (discarded):

a. We saw that audit logging configs offer the possibility to post messages to a Kafka topic. We haven't evaluated it to the end, but Azure EventHub might be able to process such messages as well.

b. Kafka/EventHub is naturally a temporary place for messages, which implies that something else should consume those messages from there and save them to their final destination (our logs storage).

c. As the idea is to eliminate this link/dependency on the Design Node, it would make no sense to build a batch job on DSS to read those logs from EventHub and save them to our logs storage. Instead, we would need some software engineering team to develop some service somewhere else to do this job. In summary, more points of failure, more things to orchestrate, more unnecessary complexity.

d. Also, if we evaluated this setup in parallel, we would potentially have redundant logs being saved to our logs storage, which would compromise data quality.

4. Plan B (the one we chose):

a. We installed an EventServer in our Automation Prod node, configured to save the logs to the same destination configured in the EventServer running in our Design Node, thus being a perfect contingency.

b. We tested the ability to have an API deployed and sending logs to the EventServer running in our Design Node and then redeploying this same API now pointing logs to the EventServer running in our Automation Node.

c. The assumption was that as the API replicas were replaced in the Kubernetes cluster, the new running docker containers would start sending the logs to the new URL (the EventServer running in the Automation Node). We tested and validated this assumption, both by checking the event server logs in both nodes and by checking the data being saved to the logs storage.

d. With this setup, we should now be able to redeploy our real business APIs with a new configuration pointing to the other instance of EventServer and then do whatever we want to do with the Design node, without losing any audit logs.

5. Questions to answer:

a. Compared to the know-how from other teams using Dataiku DSS, is this a good approach? We humbly consider that we might be missing something.

b. Do larger clients also use EventServer? A note here is that the official documentation from Dataiku states that: "The Event Server is not highly-available nor horizontally scalable. It should however adequately serve the needs of most customers, and can handle thousands of events per second.". We don't have now anything close to "thousands of events per second" but we may have in the future.

The above questions were successfully answered by Dataiku support, but are now open to the community here, so it can help others and eventually receive new suggestions!

Thank you all!

daniel_adornes · February 2022

Hey @HarizoR
!

Thank you for the answer here however this wouldn't be a proper solution to our main need (#2 bullet point).

If we considered only the EventHub/Kafka approach, it would be certainly better to create a consumption pipeline in DSS than creating (and maintaining) an external service to read from the topic and save to the final logs storage.

The reason we discarded this approach (a streaming data flow in DSS) was that it wouldn't solve the need detailed in the #2 bullet point, in terms that it would still imply a dependency in the DSS node.

The redundant EventServer is the solution that we ended up using, and solved well for our current needs (we wanted to upgrade the VM, which is now done, and next month we want to upgrade to DSS 10).

Our consultancy team in Dataiku suggested that a possible good approach, in the long run, would be to have a dedicated node for EventServer, which would totally isolate it from the other nodes, thus making any sort of upgrade not impact the audit logs.

Thank you again for engaging here! I hope this is overall a valuable thread for the community!

Cheers

HarizoR · February 2022

Hi Daniel, thanks for this insightful feedback

Regarding question 5b., as of today we didn't get any feedback about scaling issues with the Event Server, which remains the simplest solution to implement.

You mentioned in 3. the possibility of targeting a message broker like Kafka, which would be an alternative if one day the load of your incoming messages goes beyond the capacity of the Event Server. Having to manually save the logs from a Kafka topic to your log storage would indeed be cumbersome, however since DSS 9 you can handle streaming-based items in your DSS Flow. For example, you could create a streaming endpoint based on the Kafka topic where your logs are collected, then create a Continuous Sync recipe to save the log messages to a regular Dataset.

If you are interested, we published a short blog post to summarise the streaming capabilities of DSS.

Best,

Harizo

Tsen-Hung · March 2022

To recap, I believe the question lies on whether having a "continuous recipe built" within a streaming endpoint can really make sure we still can consume the audit logs while we restart the instance or simply asking will the streaming job turned "off" due to the instance restart?

daniel_adornes · March 2022

Hey Tsen-Hung!

I understand that the first component to ensure that logs are not lost would be the Kafka/EventHub topic, which can be configured to hold messages for 24 hours or larger periods.

This way, if DSS was down for some time, the logs would still be saved by the messaging component (Kafka/EventHub), and once DSS is up again, the streaming project would continue from where it stopped.

@HarizoR
may be more experienced to complement/correct our understanding here

Contingency for Audit Logs (EventServer, Kafka, etc.)

Best Answer

Answers

Categories

Setup Info

Tags