The conversation starts..
..usually with the manager saying: “The Users are complaining that the dashboards have not been updated since yesterday. Did we not get a load from x this morning ?” . You then reply , “I’ll have a look, it looks like we have had a success email from the pipeline. Í’ll have to have a deeper look”. Then follows a trawl through various logs that are located on web servers, database tables and log files. In the end you find out about lunch time that not all the files have arrived .
There are a number of factors here :
- The different developer groups all like to use different types of logging for their respective areas.
- This leads to a preponderance of different files and environments to look at which takes time.
- A co-ordinated central logging area can help with speeding up the initial error diagnosis
- If the number of files expected was being monitored, again the issue would have been picked up earlier.
Another common scenario is when a second job fails due to a job that it depends upon still running. Originally the first finished a long time before the second, but the first jobs running length has been getting longer for various reasons. If the overall run lengths of time was being captured it would have been more obvious that this was happening.
Centralised audit …
.. I know this is not everyone’s cup of tea but I have repeatedly found it pays dividends over the years. The easiest to implement is a database table . The minimum audit requirement is a start and end audit event for the each process. This will give us enough information, as we will see later, to provide a basic understanding of the how the various processes in the system are running.
If your processes or pipelines are complex , its sensible to add start and end events for each sub process. Also very useful is the ability of capture numbers of records and files as the data flows through the pipelines/processes. These can also be charted later. The example below shows a classic view of an audit log.
You can see from the example above a couple of processes running with subtasks/subprocesses also being audited . In addition you can also see various counts being logged. The fields and their usage will be discussed in detail in Part 2 of the Audit FrameWork blog.
Add Visualisation …
Once you have your audit repository, you can then query it for error events, see when certain events happened or didn’t on a certain day, each day of the week etc. That’s only the start, you can then run visualisations off the audit table and start to look for trends.
I have knocked up below example visualisations of what these could look like. You can see how long the daily run(sessions) have taken, the number of files received each day (straight blue line on left), and record count of data received. Below that is timings of the different stages of a pipeline. These may use differing technologies (Microsoft Databricks for example), so any change in a stage may be a sign of a problem with that technology. Microsoft had issues with their DataBricks architecture early on in the year for example.
In addtion to the detail picture of a feed , you can collate the sessions for all pipelines/processes that are running in a day. This tells you where you are; what has run, what has errored, what is still running. Again this can be used to see if you have any underlying issues…
The colour coding for the chart below could be:
- Grey – no audit present
- Red – Error
- Blue – Successful
- Yellow – Currently running
So this gives an idea of how useful, well structured auditing can be . In Part 2 I will focus more on how to implement the audit log, and Part 3 will deal more with the visualisation side.
Hope you have enjoyed it! You can leave comments below or use the Contact form to send me a message.