Written by Abdur-Rahman Patel
I have recently come across an organisation where all their software, be it COTS or in-house built are free of bugs. Their LAN has quadruple redundancy, it never fails them, not even a single node. They guarantee zero percent downtime on every piece of hardware and software within the organisation and I’m not talking mission critical ones only. They also guarantee zero percent downtime on all external partner systems. Data is always perfectly in sync and up to date across the enterprise. When implementing, their ‘super technicians’ have a 100% record for going into production with the correct configuration across all components. Basically, they never make mistakes. Timeout, LAN latency, slow response? Not in their vocabulary.
“Yeah right!” I hear you say about 30 seconds ago. Not even Chuck Norris can pull that off. So it seems obvious then that we don’t live in a perfect world, right?
In the real world, things are bound to go wrong, even if we have near bug-free software. Hardware may fail, even non-IT related infrastructure may fail us, data is never going to be 100% clean, perfect and always available.
Consider that within any large organisation, there are a number of complex orchestrated processes that can be executed in seemingly infinite possible permutations. If that’s not bad enough, try an infinite combination of possible data values and data combinations. Throw in a mix of regression by way of constant changes within the enterprise. It’s impossible for any architect, designer, analyst, developer or tester to for-see let alone cater for every possible real-life scenario.
Do you understand now why 100’s of things can go wrong in the most well built and tested solution?
So, if we know that the world is not perfect and that things can go wrong, what are we doing about it? Yes, we do handle exceptions gracefully, we have brilliant logging to allow some support technician to do fault finding.
The problem I have with the approach above is that it all happens within the silo of each individual component or application and maybe even within a domain where the exception occurred. We don’t have control over the ramifications of our exceptions to upstream or downstream systems. We’re not even aware of the context in which our application or service was being used, (and this is a good thing in any loosely coupled system). Added to that, each application / component handles exceptions in their own custom manner. There is duplication of effort, every component must build-in retry management, store & forward capabilities, mechanisms to manage queues and the list goes on.
What I’m saying folks is that application level exception handling is just not sufficient. An integrated solution needs to be seen as an entity in its own right, greater than the sum of its parts. Error Handling is a cross cutting concern that cannot be ignored.
I’m trying to impress upon you the need for an enterprise capability providing reusable services to address errors and exceptions that occur within a SOA implementation in a standardised and consistent manner. A capability to keep us aware of failed transactions and that allows us to rectify these failed transactions. Secondary benefits of such a capability, if implemented correctly, will act as a dipstick into the health of our interfaces, reporting to highlight problem areas among others.
Peaked your interest? I hope to share with you some of the key considerations with regards to error handling in a SOA space in future writings.