Does SLA Impact DSA?

When potential customers are considering your company’s products, naturally everyone wants to put their best foot forward.  When they ask about Service Level Agreements (SLA), it can be easy to promise a little too much.  “Our competitor claims four nines (99.99%) up-time; we’d better say the same thing.”  No big deal, right?  Isn’t is just a matter of more hardware?

Not so fast.  Many people are surprised to learn that increasing nines is much more complicated than “throwing hardware at the problem.” Appropriately designed Distributed System Architecture (DSA) takes availability and other SLA elements into account, so going from three nines to four often has architectural impacts which may require substantial code changes, multiple testing cycles, etc.

Unfortunately, SLAs are often defined reactively after a system is in production.  Sometimes an existing or a potential customer requires it, sometimes a system outage raises attention to it, and so on.

For example, consider a website or web services hosted by one web server and one database server.  Although this system lacks any supporting architecture, it can probably maintain two nines on a monthly basis.  Since two nines allows for 7 hours of downtime per month, engineers can apply application updates, security patches and even reboot the systems.

 

Three nines allows for just 43.8 minutes per month.  If either server goes down for any reason, even for reboot after patches, the risk of missing SLA is very high.  If the original application architecture planned for multiple web servers, adding more may help reduce this risk since updating in rotation becomes possible.  But updating the database server still requires tight coordination with very little room for error.  Meeting SLA will probably be lost if an unplanned database server outage occurs.

This scenario hardly scrapes the surface of the difficulties involved for increasing just one aspect (availability) of a SLA.  Yet it also highlights the necessities of defining SLAs early and architecting the system accordingly.  Product Managers/Planners: Take time in the beginning to document system expectations for SLA.  System Architects: Regardless of SLA, use DSA to accommodate likely expectation increases in the future.

Perils of Async: Locking Out Performance

In a previous post, Perils of Async: Data Corruption, we saw the consequences of inadequate concurrency control in asynchronous code.  The first implementation using Parallel.ForEach did not protect shared data, and its results were wrong.  The corrected implementation used C#’s lock for the necessary protection from concurrent access.

Parallel.ForEach(input, kvp =>
{
    if (0 == kvp.Value % 2)
    {
        lock (mrr)
        {
            ++mrr.Evens;
        }
    }
    else
    {
        lock (mrr)
        {
            ++mrr.Odds;
        }
    }
    if (true == AMT.Math.IsPrime.TrialDivisionMethod(kvp.Value))
    {
        lock (mrr)
        {
            ++mrr.Primes;
        }
    }
});

Some may ask, “Why lock so many times? Can’t the code just lock once inside the loop?”

Parallel.ForEach(input, kvp =>
{
    lock (mrr)
    {
        if (0 == kvp.Value % 2)
        {
            ++mrr.Evens;
        }
        else
        {
            ++mrr.Odds;
        }
        if (true == AMT.Math.IsPrime.TrialDivisionMethod(kvp.Value))
        {
            ++mrr.Primes;
        }
    }
});

Although moving the lock just above the first if clause seems to have some benefits – it simplifies the code, shared data access is still synchronized, etc.  But it also kills performance – making it slower than even the non-parallel SerialMapReduceWorker.

9999999 of 9999999 input values are unique
[SerialMapReduceWorker] Evens: 5,000,533; Odds: 4,999,466; Primes: 244,703; Elapsed: 00:00:51.6998025
[ParallelMapReduceWorker] Evens: 5,000,533; Odds: 4,999,466; Primes: 244,703; Elapsed: 00:00:30.6871152
[ParallelMapReduceWorker_SingleLock] Evens: 5,000,533; Odds: 4,999,466; Primes: 244,703; Elapsed: 00:01:35.0778434

This situation highlights the common rule of thumb, “lock late.”  Locking late (or “low” in the code) implies that code should lock just before accessing shared data, and unlocking just afterwards.  This approach reduces the amount of code which executes while the lock is held, so it provides contenders (the threads) with more opportunities to acquire the lock.

 

Perils of Async: Data Corruption

One of the most common bugs occurring in any multi-threaded or multi-process code is corrupting shared data due to poor (or lack of) concurrency control.  Concurrency is one term used to describe code interactions that are not sequential in nature (equivalent or companion terms include parallel, multi-threaded and multi-process).  Within this context, concurrency control indicates the tactics used to ensure the integrity of shared data.

To demonstrate this problem, we’ll use a fairly simple example: Counting even, odd and prime numbers in a large set.  We’ll use different strategies over the same data set for serial and parallel processing.  The serial implementation, SerialMapReduceWorker, is straight-forward since it involves no concurrency issues.

foreach (var kvp in input)
{
    if (0 == kvp.Value % 2)
    {
        ++mrr.Evens;
    }
    else
    {
        ++mrr.Odds;
    }
    if (true == AMT.Math.IsPrime.TrialDivisionMethod(kvp.Value))
    {
        ++mrr.Primes;
    }
}

SerialMapReduceWorker iterates over the set of integers, kvp, determines whether each integer is even, odd or prime, and increments the appropriate counter in the MapReduceResult instance, mrr.  (Although no map-reduce is involved, SerialMapReduceWorker is named for consistency with concurrent workers)
.NET’s Task Parallel Library (TPL) makes it very easy (too easy?) to convert this code to run concurrently.  All a developer has to do is change foreach to Parallel.ForEach, include some lambda syntax, and voila!, the code magically runs much faster!

Parallel.ForEach(input, kvp =>
{
    if (0 == kvp.Value % 2)
    {
        ++mrr.Evens;
    }
    else
    {
        ++mrr.Odds;
    }
    if (true == AMT.Math.IsPrime.TrialDivisionMethod(kvp.Value))
    {
        ++mrr.Primes;
    }
});

Just look at these results – the parallel version executed almost twice a fast!

9999999 of 9999999 input values are unique
[SerialMapReduceWorker] Evens: 5,000,533; Odds: 4,999,466; Primes: 244,703; Elapsed: 00:00:51.6998025
[ParallelMapReduceWorker_Unprotected] Evens: 4,996,020; Odds: 4,994,704; Primes: 244,662; Elapsed: 00:00:30.5742845

Unfortunately, this conversion is also a dangerously naive implementation of concurrent code.  Did you notice the problems?  The parallel code found a different number of even, odd and prime numbers within the same set of integers.  How is that possible?  Answer: data corruption due to a lack of concurrency control.

The implementation in ParallelMapReduceWorker_Unprotected does nothing to protect the MapReduceResult instance, mrr.  Each thread involved increments mrr.Evens, mrr.Odds and mrr.Primes.  In effect, the threads might increment mrr.Evens from 4 to 5 simultaneously when the expectation is that one will increment from 4 to 5 and another from 5 to 6.  As you can see in the results above, this unexpected data corruption causes ParallelMapReduceWorker_Unprotected’s count of even integers to be wrong by about 4,000.

In this case, correcting the error is fairly simple.  The code just needs to protect access to MapReduceResult to ensure that only one thread can increment at a time. The corrected ParallelMapReduceWorker:

Parallel.ForEach(input, kvp =>
{
    if (0 == kvp.Value % 2)
    {
        lock (mrr)
        {
            ++mrr.Evens;
        }
    }
    else
    {
        lock (mrr)
        {
            ++mrr.Odds;
        }
    }
    if (true == AMT.Math.IsPrime.TrialDivisionMethod(kvp.Value))
    {
        lock (mrr)
        {
            ++mrr.Primes;
        }
    }
});

Each time this code determines it needs to increment one of the counters, it uses concurrency control by:

  1. Locking the MapReduceResult instance
  2. Incrementing the appropriate counter
  3. Unlocking the MapReduceResult instance

Since only one thread can lock mrr, other threads must wait until it is unlocked to proceed.  This locking now guarantees that, continuing our previous case, mrr.Evens is correctly implemented from 4 to 5 only once.  ParallelMapReduceWorker correctly calculates the counts (as compared to SerialMapReduceWorker), and does with almost the same performance as the unprotected version.

9999999 of 9999999 input values are unique
[SerialMapReduceWorker] Evens: 5,000,533; Odds: 4,999,466; Primes: 244,703; Elapsed: 00:00:51.6998025
[ParallelMapReduceWorker_Unprotected] Evens: 4,996,020; Odds: 4,994,704; Primes: 244,662; Elapsed: 00:00:30.5742845
[ParallelMapReduceWorker] Evens: 5,000,533; Odds: 4,999,466; Primes: 244,703; Elapsed: 00:00:30.6871152

 

NOTE: The count of primes appears to be incorrect. The IsPrime.TrialDivisionMethod implementation is intentionally slow to ensure multiple threads contend for access to the same data.  Unfortunately, and unintentionally, it also appears to be incorrect. (c.f., Count of Primes)

Perils of Async: Introduction

As application communications over lossy networks and “in the cloud” have grown, the necessity of performing these communications asynchronously has risen with them. Why this change has been occurring may be an interesting topic for another post, but a few simple cases demonstrate the point:

  • Web browsers make multiple, asynchronous HTTP calls per page requested. Procuring a page’s images, for example, have been asynchronous (“out-of-band”) operations for at least decade.
  • Many dynamic websites depend on various technologies’ (AJAX, JavaScript, jQuery, etc.) asynchronous capabilities – that’s what makes the site “dynamic.”
  • Similarly, most desktop and mobile applications use technologies to communicate asynchronously.

Previously, developing asynchronous software – whether inter-process, multi-threaded, etc. – required very talented software developers. (As you’ll see soon enough, it still does.) Many companies and other groups have put forward tools, languages, methodologies, etc. to make asynchronous development more approachable (i.e., easier for less sophisticated developers).

Everyone involved in software development – developers, managers, business leaders, quality assurance, and so on – need to be aware, however, that these “tools” have a down-side. Keep this maxim in mind: Things that make asynchronous software development easier also make bad results Ibugs!) easier. For example, all software involving some form of asynchronicity

  • Not only has bugs (as all software does), but the bugs are much, much more difficult to track down and fix
  • Exhibits higher degrees of hardware-based flux. Consider, for example, a new mobile app that is stable and runs well on a device using a Qualcomm Snapdragon S1 or S2 (single-core) processor. Will the same app run just as well on a similar device using (dual-core) Snapdragon S3 or above? Don’t count on it – certainly don’t bet your business on it!

This series of posts, Perils of Async, aims to discuss many of the powerful .NET capabilities for asynchronous and parallel programming, and to help you avoid their perilous side!