Cache Consistency: Avoiding Common Issues

Michael Yeaney August 7, 2017

Leveraging a cache is an effective way to increase the performance of many applications, and can dramatically reduce the load on backend components as we are no longer repeatedly fetching the same data over and over. Since a lack of caching is often considered a performance anti-pattern, many systems tend to add in caches only after they have exhibited latency and/or performance that is unacceptable - in other words, added caching is often an afterthought.

Unfortunately, one area that is commonly overlooked is the impact of your cache consistency model on the overall behavior of the application. Before we dig into these impacts, let's back up and consider an basic application *without* caching.

Initial Design

In this example, we'll assume a canonical web application consisting of a single-page javascript client which calls a REST API tier that is making calls into a database:

Application Block Diagram

The REST API exposes two simple operations, namely GetSomeValue(key) and UpdateValue(key, newValue). In our simple implementation, the REST API proxies these calls (after validation, etc.) directly through to the underlying SQL database, invoking SELECT...FROM...WHERE for the GetSomeValue(key) call, and a simple UPDATE...SET...WHERE for the UpdateValue(key, newValue) call. Things seem to work well; that is, so long as there are no concurrent writes to the same key.

Failure Modes

So what can go wrong? In the naive example above, we have essentially implemented a Last Write Wins concurrency model. In this model, whatever statment runs "last" (relative to wall-time) against the database will overwrite any previous values. The issue here is that the last statement run by the database does not necessarily correspond to the last update sent by a client. Thread scheduling, garbage collection, event network latency/retires can cause the "last" request to be processed first, as show below:

Last Write Wins Causality Diagram

These effects are even more noticable when dealing with multiple machines (for example, a cluster of REST API servers), since there is no coordination between machines on thread scheduling, garbage collection intervals, etc. Whatever the root cause, the end result is overwritten/vanishing updates, and unhappy users.

Fixing the problem

So how can we address this issue? There are a number of options, most relying on some sort of CAS (compare-and-swap) operation to make sure the value being overwritten hasn't been changed from what was originally seen (or is at least "newer"). This is typically done by comparing a "version" field, and if it matches (or is optionally newer), the update can proceed and the version is updated. To accomodate this change, our REST API signature for updates will now need updated to something like UpdateValue(key, version, newValue). This allows us to specify not only the new value to set, but also to only set this new value if and only if the version has not changed.

This type of modification has shifted us from Last-Write-Wins to First-Write-Wins, meaning the first request that the database successfully executes is the only update applied, as shown below:

First Write Wins Causality Diagram

This is typically the exact behavior that most systems are after, and guarantees data safety in the face of concurrent updates.

Making it Faster

At this point, we may determine that there is no reason to make a network trip the database for every call to GetSomeValue(key), especially if the value hasn't changed. So we decide to introduce a cache and use the following basic psuedocode to keep it updated:

GetSomeValue(key):
    if (cache.Contains(key)):
        return cache[key]
    else 
        value = getValueFromDataBase(key)
        cache.SetValue(key, value)
        return value
    end

UpdateValue(key, version, newValue):
    if (updateValueInDatabase(key, version, newValue):
        cache.SetValue(key, newValue)
    end
    

This is a variation of the cache-aside pattern, and essentially only updates the cache store if and only if the underlying datastore was udpated.

Performance is now improved, and writes are fully protected by the underlying concurrency checks in the database. All is well, right?

Not So Fast

The issue with the caching pattern above is that the code is written to operate in a *Last-Write-Wins* manner, meaning that the last write to the cache will blindly overwrite the previous value. The naive assumption is that the last operation successfully run by the database is the last on issued to the cache - but that simply isn't true on the timescales that computers operate, as shown below:

Cache Overwrite Causality Diagram

From our discussion above, remember that thread scheduling/pausing, garbage collection, network delays, machines crashes, etc., call all make operation run out-of-order from wall time. This new failure mode will exhibit the same consistency errors of the original application design, yet the underlying databse will be correct (and also NOT match the cache).

What to do?

At this point, some teams choose to avoid the caching system altogether (sacrificing performance), attempt to resort to locking techniques that don't work across multiple machines, or event resort to distributed locking leveraging systems such as Apache Zookeeper. Alternatively, we can instead just implement the same CAS-style operations on your cache that your datastore is leveraging. The exact semantics of this vary between vendors, but the basic flow is something like this:

UpdateCacheValue(key, version, newValue):
   if (cache.Contains(key)):
        cachedValue = cache.GetValue(key)
        if (cachedValue.Version <= version):
            cache.SetValue(key, newValue)
        end
   end
    

Notice in this example code, we're no longer using a strict equality comparison when checking the version. Instead, we are making sure that the version provided by the client is equal to or "newer" (assuming version counters are monotonically increasing). This helps avoid the case of all future updates being blocked if a previous client crashes before writing a new version to cache (meaning no further matches are possible). We could instead choose to flush cache entries that do not match a provided version/threshold; however this must be done with care in order to avoid cache stampede effects under load.

The exact syntax of this pattern will vary by the caching platform (e.g., Redis uses the WATCH command to enable CAS semantics). Whatever technique is used, this enables us to maintain the desired consistency behavior in our application (namely, *First-Write-Wins*, giving us data safety in the face of concurrent updates:

Cache Version Check Causality Diagram

Wrapping Up

As with any pattern, how and when you use will always depend on your exact use case. For example, some applications can tolerate (or even require) and eventual consistency model where intermediate inconsistencies are acceptable. Regardless, it is important to remember the data consistency model of an applictaion isn't limited to the scope of a single data store - it is a composite behavior exhibited by the application, and as we've shown above can be influenced by components that have nothing to do with your primary data store.

Ramblings and thoughts on cloud, distributed systems, formal methods...maybe even some code, too!