When potential customers are considering your company’s products, naturally everyone wants to put their best foot forward. When they ask about Service Level Agreements (SLA), it can be easy to promise a little too much. “Our competitor claims four nines (99.99%) up-time; we’d better say the same thing.” No big deal, right? Isn’t is just a matter of more hardware?
Not so fast. Many people are surprised to learn that increasing nines is much more complicated than “throwing hardware at the problem.” Appropriately designed Distributed System Architecture (DSA) takes availability and other SLA elements into account, so going from three nines to four often has architectural impacts which may require substantial code changes, multiple testing cycles, etc.
Unfortunately, SLAs are often defined reactively after a system is in production. Sometimes an existing or a potential customer requires it, sometimes a system outage raises attention to it, and so on.
For example, consider a website or web services hosted by one web server and one database server. Although this system lacks any supporting architecture, it can probably maintain two nines on a monthly basis. Since two nines allows for 7 hours of downtime per month, engineers can apply application updates, security patches and even reboot the systems.
Three nines allows for just 43.8 minutes per month. If either server goes down for any reason, even for reboot after patches, the risk of missing SLA is very high. If the original application architecture planned for multiple web servers, adding more may help reduce this risk since updating in rotation becomes possible. But updating the database server still requires tight coordination with very little room for error. Meeting SLA will probably be lost if an unplanned database server outage occurs.
This scenario hardly scrapes the surface of the difficulties involved for increasing just one aspect (availability) of a SLA. Yet it also highlights the necessities of defining SLAs early and architecting the system accordingly. Product Managers/Planners: Take time in the beginning to document system expectations for SLA. System Architects: Regardless of SLA, use DSA to accommodate likely expectation increases in the future.