High Availability Service Management in SaaS Environments

With clients utilizing SaaS for critical information exchange across multiple processes and time zones, the expectation is for SaaS platforms to always be available and always performing. Even scheduled downtime for upgrades can be troublesome for business-critical activities.

22 July 2009

Service’s gold standard
The gold standard for service availability in the olden days was the “dial tone.” You could pick up the phone and you always knew that familiar tone would be there to greet you. Managing mission-critical software services needs a similar gold standard. With clients utilizing SaaS for critical information exchange across multiple processes and time zones, the expectation is for SaaS platforms to always be available and always performing. Even scheduled downtime for upgrades can be troublesome for business-critical activities.

This standard for platform availability is indeed very much out of the reach of most commercial SaaS platforms today. Some of the challenges are due to lack of rigor in software design and testing, while others have to do with lack of operational environments which are meant for high-availability services.

What is high availability and how do we measure it?
Before we delve into how to plan service management for highly available SaaS platforms, it is important to understand how to best measure availability. We also need to set the expectations for various platform components to deliver high availability service.

In general terms, business-critical applications need to be available between 99.9% (8.76 hours/year downtime) to 99.95% (4.38 hours / year downtime). This clearly provides us with a budget for managing downtime attributed to unplanned failures and scheduled systems upgrades.

Planning for high availability SaaS platforms
As we all know, a chain is only as strong as its weakest link. Looking at a SaaS environment, we see that software architecture, software quality control, operational deployment procedures, network and server architecture and proactive service monitoring are all links of the chain designed to provide a highly available environment.

A rigorous approach towards providing highly available service needs to be carefully addressed during the four major phases of architecture, implementation, deployment and monitoring.

The architecture phase
In the architecture phase, the first major challenge is to ensure that the software modules are indeed highly decoupled where failure in one module does not cause a chain of events leading to entire system being compromised. Service Oriented Architecture, with a focus on loose coupling and service contracts, provides an excellent foundation for development of application architectures that live up to high availability guidelines.

The second major challenge during the architecture phase is to develop a deployment architecture that is resilient and responsive to user load and adverse network conditions. Server and storage virtualization with the capability to deploy redundant virtual instances across multiple hardware servers and failover/load sharing amongst various instances are some of the building blocks to a highly available platform. At Intralinks, we have a dedicated architecture team that has focused practice areas in security, application architecture and platform architecture. This team not only defines the architectural components while starting new projects, but also monitor operational feedback and service incidents to continuously improve service performance.

The implementation phase
During the implementation phase, rigorous testing procedures simulating boundary conditions for user behavior, network behavior and server behavior need to be implemented and exercised. At Intralinks, one of our best practices is a performance engineering team that proactively tests all newly developed software against service benchmarks under load and adverse conditions.

The deployment phase
Operational deployment is indeed when the rubber meets the road. This phase deals with not only the deployment of new capabilities but also planned upgrades. This is where standardization, automation and process control are most important, as they ensure that not a minute is wasted while dealing with planned or unplanned events. Each deployment needs to be clearly scripted, peer reviewed and tested in staging environments well ahead of the upgrade windows. Failover procedures need to be constantly updated to ensure that there are no surprises.

High availability means higher standards
Finally, you can only improve what you can measure. SaaS platform instrumentation for user experience, availability and capacity are important to ensure that any potentially service-affecting trends are proactively monitored and reliably acted on before becoming a real issue. Automation and monitoring using synthetic as well as real user response times, and the ability to drill down into the path of each transaction end to end, are quite important in understanding actual vs. planned behavior and mitigation for adverse trends.

At the end of the day, high availability services are a 24x7x365 commitment. A combination of excellence in people, process and technology with vigilance is the only way to ensure that the SaaS platform is indeed providing the service as committed to end users and clients.

Fahim Siddiqui

Fahim Siddiqui

Fahim served as Chief Executive Officer at Sereniti, a privately held technology company. He was also the Managing Partner of K2 Software Group, a technology consulting partnership providing product solutions to companies in the high tech, energy and transportation industries with clients including Voyence, Inc., E-470 Public Highway Authority and Tellicent, Inc. Previously, Fahim held executive and senior management positions in engineering and information systems with ICG Telecom, Enron Energy Services, MCI, Time Warner Telecommunications and Sprint.