This blog is the second in a series from Adam Woolhouse, NETSCOUT. The first in the series is “Running UK Gov Digital Org.”
As government departments, agencies and organisations strive for an increasingly paperless environment, the organisations become digital organisations.
When was the last time your critical services were audited for performance and security? This blog discusses a multidisciplinary approach to how this might be achieved.
Efficient applications mean more productivity, more productivity leads to lower cost. Having an effective way of looking at performance and security issues with an application leads to lower Mean Time to Resolve (MTTR) of a problem requiring fewer man-hours and impacting fewer users and stakeholders again lowering costs and increasing efficiencies.
In a complex service with multiple disciplines delivering that service to end users, multiple towers are used to provide this service and pointing the finger to a particular tower when a problem arises leads to time lost in MTTR. The service towers spanned may be an application, networks, infrastructure, data centre, end user, among others. This discipline may see the rise of one further tower called Service Management or Service Integration and Management (SIAM), which may help to pull all the towers together to form a cohesive unit for digital government delivery.
Having a multiple discipline and a cross-tower fault resolution process can help keep the organisation running.
For this exercise, I separate out the functions of the digital service as follows; overall service health, application, service dependency mapping, end-user response time, sessions, forensics, physical and system errors, physical system and connectivity health and finally security. These subjects will be dealt with in turn, but at the end, a process will be shown that ties them all back together so that overall service health can be measured, quantified and triaged should problems occur within.
Overall Service Health
Is there a way in your organisation of scoring the health or security of an application? Keeping tabs on this and understanding the baseline is a way of maintaining the organisation’s digital health and security under constant scrutiny, maximising cost savings for the organisation and optimising the end user’s experience. The end user may be a member of the organisation or a member of the public. Maximising the end user experience can increase adoption of the alternative digital service and improving the efficiencies in the system.
A baseline is useful in understanding how the service changes over time, how it reacts under abnormal loads in times of stress, for example when a deadline for tax returns is approached, or holiday season means an uptick in passport applications. How does the service react, does it expand flexibly if cloud-based, for example, and how does the service cope under pressure?
Once a baseline has been measured then this can be used to track abnormalities in the baseline to look at exceptions from this baseline as evidence of degradation of the service leading to inefficiencies and increasing costs.
Application
The application should be the focus of the exercise after all this is what replaces the paper system and is the reason for building the system in the first place. Example applications may be applying for a driving license, filling in a tax return, applying for a pension or arranging a doctor’s appointment.
While developers build their applications in a test environment with near perfect conditions and with user loading not understood, an application once deployed becomes a system of multiple variables.
When deployed, does the organisation have a way of spotting application errors that may not be apparent to the application, the application developer or the end user? These might only start to become evident in end-user response time increases and vagaries in the system which may go unreported for a time after they start.
Very often these errors are hidden in arcane communications between components of the service. The service will have many components due to layers within it and for redundancy purposes. These different components then each have their multiple dependencies to help run the service, authenticate the users and distribute information between the components of the service.
How is each conversation working at the application layer and is any component complaining to another with an application error? Monitoring these conversations is a way of seeing this.
Service Dependency Mapping
In the early days of developing an application, the right size or structure of an application may not be apparent.
As a service matures or migrates platforms and people leave, skills and knowledge about the makeup of a service get lost in time.
Is it possible within your organisation to map out what makes up a service, how it communicates and what it is dependent on? This logic map can then inform an organisation at a time of system or platform migration about what and how the service may be increased, moved and changed.
End User response time
At the end of a service is a user. We have probably all experienced the turning hourglass, the spinning circle on our browsers or the filling task execution bar. Often these waits are acceptable, but all too often these delays become frustrating. These times of frustration make end users lose faith in the system and cause costs to rise and efficiencies to decrease.
Does your organisation have a way of measuring these response times? Are they baselined and understood? Are they rectified when they become unacceptable either by increasing capacity or fixing issues, or they little understood and rely on anecdotal evidence from end users, which cause hours of endless searching and triage?
Overall Service Health
Each department, agency or organisation might have multiple services to deliver. Each of these services will have their own and sometimes common components. Does your organisation have a way of scoring these applications independently and against each other so that priorities can be intelligently set?
These scorecards of performance could then be shared within the organisations to raise awareness between towers to help the organisation with its purpose, by using report, alerts and dashboards.
These scorecards and overall health methods should be independent of the service and use a common methodology to compare with each other rather than comparing scores derived in different manners. This makes it easier to determine how services are operating from the third-party viewpoint.
Sessions
With a service having multiple components, each of the conversations between the components is made up of multiple sessions. By monitoring each session and then rolling them into a session health score, makes it easier to spot failed sessions and when conversations start to get inefficient. Inefficient conversations lead to increasing end-user responses and a loss of money and increasing use of resources, both infrastructural and personal.
Forensics
If there are issues of inefficiencies in the system, to what level of forensics is your organisation able to go to? Forensics are required for in-depth triage, for reporting and communications issues within and between tower services.
If too little forensic data can be seen at the time of issues this leads to having to wait for the issue to happen again while using spot forensics, hastily set up, increasing triage time.
Machine data is often stored, but sometimes this is not enough to understand the underlying issues.
Physical and System Errors
Even in the best-designed system, there will be over-capacity reached at times. This is because a system or service is designed for optimal cost and capacity reasons. Catering for the last few percentiles in system capacity increases the cost of the systems beyond all proportion of the benefits that it brings.
When a system is over-capacity and data spills from the system is it possible to spot these occurrences and make sure that they are kept within acceptable limits?
Physical System and Connectivity Health
While system monitoring may be employed, how does it tie in with the above topics and how can it be correlated with the service audit functions above. Does your organisation have a way of tying all the topics above with a common dataset, or are all the datasets collected in isolation without a way of correlating them together except maybe by time and by eye?
Is it possible to map out the physical setup of services within your organisation? This can then inform if they are all running compatibly with each other and with no unexpected or lengthy paths between service dependencies.
Security
Lastly, we reach the topic of security. Implementing a regime above is not without its cost, although the cost saving of doing so often outweigh the investment many times. But implementing the methodology also increases security.
Legitimate conversations can be characterised and therefore illegitimate conversations also highlighted.
Distributed denial of service (DDoS) attacks can also be seen in conversations, and these can then be mitigated against. Denial of service may take the form of volumetric attacks (literally loading the system with illegitimate and huge amounts of traffic), state exhaustion attacks, (loading up a system with a known amount of sessions which will make it stop working or behave erratically, which behaviour can then be exploited by a hacker) and application layer attacks which exploit known vulnerabilities in applications. Using a monitoring methodology above can help spot these attacks as they are happening.
Finally, with security, is the level of forensics sufficient to spot and stop infiltration and a possible exfiltration of data within your service and systems. Is it possible to compare what is happening within your system with what is happening to other systems worldwide to react to zero-day attacks from unscrupulous actors?
How Can NETSCOUT Help?
Business assurance combines service assurance, cyber security and business analytics solutions for IT and security operations. We harness the full power of internet protocol intelligence, obtained from network traffic, as the common data foundation for all these applications.
Our customers solve problems faster, protect their business from cyber-attacks and obtain the best intelligence for insightful and timely business analysis.
NETSCOUT is the only company to support service assurance, packet broker, cyber security and big data analytics from a common technology platform.
More information can be found here, in a 2 minute video.