Blogroll
6 Apr 10 Infinite web scalability & resilience with Amazon Web Services

Over the easter weekend I have been rewriting large parts of the content serving functionality of my webapps, extracting into a common code-base and getting rid of dependencies to MySQL. There are a few problems with serving web content from a relational database such as MySQL:
- Scalability: master-slave setups are complicated, and still don’t scale very well.
- Resilience: doing database restores takes preparation and time, there is a lot of margin for error.
- Impedance mismatch: let’s face it, simple content-serving, even semi-dynamic content doesn’t exactly need the overhead of a RDBMS.
Enter Amazon Web Services
The diagram at the top depicts the setup I have now moved to (but yet to deploy), the basics are that a WebPage is made up of a number of Panel, such as a ContentPanel. When a ContentPanel retrieves the content it checks for content in three steps, going to the next step if it cannot find content for a given content key in it’s current step:
- Check in-memory EhCache for content with key.
- No cache hit: Check temp-directory on the local filesystem for a file corresponding to the key. If there is a hit, put it in cache and return the content.
- No hit on local filesystem: fetch the file with the key from Amazons S3 (Simple Storage System), write it to cache and local filesystem, then return the content.
Infinite scalability & resilience achieved!
There are a number of benefits to this approach:
- Performance: most of the time, you’ll retrieve content from in-memory, or at worst local disk. The content is close at hand most of the time, especially if you get hit by a lot of traffic.
- Scalability: You can simply add more machines and scale predictably in a horizontal fashion. Also because S3 is “in the cloud”, you basically offload scaling of the “master copy” of the content to Amazon, something which they are very good at. With an RDBMS, infinite horizontal scaling would be more or less impossible, despite second-level caches and advanced bending over backwards with clustering and sharding.
- Resilience: You offload resilience of the master content copy storage to Amazon, something which they do well and guarantee by storing copies on redundant drives in data centres across the globe.
- “Crash-proof”: not quite, but because your ultimate storage is “in the cloud”, your servers are more or less stateless. If a server crashes, you simply start a new instance that will pick up it’s content from S3 transparently without complex restore-processes.
- Selective cache expiry: Given that a page is built up of n-number of Panels in the example above, each Panels content can be expired individually in a controlled manner, thus not forcing a full cache-repopulation.
What has basically been achieved here is infinite horizontal scaling and almost bullet-proof resilience, what more, you’ve done it with a downpayment on hardware of about $0!
Cache expiry and distributed environments..
As you might have guessed, I’m using Apache Wicket as a web framework, though the principle should work with most web frameworks. To achieve the ends I’ve described, I’ve used a pretty bog-standard Wicket pattern, in building a WebPage from a number of Panels, one of which is a “ContentPanel” implementation. The glue between backend and frontend in Wicket is the IModel and it’s implementations, for which I’ve written a custom S3TextModel implementation that deals with the stepwise checking of in-memory cache, local disk to S3, and populating them when any of the levels of caching are missing content.
For Cache expiry, the IModel implementation has a setObject() method. When this is called on the S3TextModel (which would happen if someone updated content), it will basically blow away the cache for that key, as well as the file in the local file-system and write the content to S3-storage.
In a distributed environment you could imagine the model still doing the same job, but also sending out an expiry message with the content-key as part of the message onto a JMS Topic, where MessageConsumers could then pick up the message and expire the cache on their respective nodes.
That’s it! In a relatively simple manner, we have achieved the holy grail of content driven websites: close enough to infinite scalability and resilience with excellent performance. To demonstrate the performance, consider these numbers I found when doing some very simple performance testing on my laptop (showing timings for all three levels of content retrieval):
INFO - S3TextModel - Temp file not found, going to S3..
INFO - S3TextModel - Content retrieved from S3 in: 1330ms
INFO - S3TextModel - Cache hit!
INFO - S3TextModel - Content retrieved in: 0ms
INFO - S3TextModel - cache miss
INFO - S3TextModel - Temp file found..
INFO - S3TextModel - Content retrieved in: 4ms
On a final note, my webapps don’t have the sort of traffic to really make use of the scalability aspects of this architecture, however I do gain the benefit of quickly and easily being able to restore any crashed servers with a minimum of hassle.
