Content management systems, such as Wordpress, are pervasive, functionally rich and have good community development, however they are generally not known for their ability to scale massively with ease. The benefits of Content Delivery Networks and caching to scale CMS and other applications are well proven. Akamai’s products are well suited - and industry leading - at scaling applications; minimizing the origin load, both in bandwidth volume and hits per second and, by extension, offering significant cost savings. As anyone who follows our technology knows, Akamai’s Edge Network offers unparalleled reach into end user networks, the globe over.
Caching content closer to your end users plays a large part in reducing origin traffic, accelerating end user experience and dealing with flash traffic (though at Akamai, we spend a lot of our time finding additional ways to make your customers experience faster, for example thorough technologies such as Akamai Adaptive Image Compression). While we are on the topic of caching -Akshay Ranganath has done a nice write up on how caching in Akamai can increase performance; https://community.akamai.com/community/web-performance/blog/2015/08/12/measuring-impact-of-page-caching
With caching though, there is a catch; there is a time trade off. The longer you cache a piece of content in Akamai, the longer in time end users must wait to see changes to your site. Longer cache lifetimes appeal to the infrastructure and finance teams as higher Akamai offload means less CapEx and OpEx in infrastructure to handle true end user load. In contrast, longer cache lifetimes tend not to appeal to the business needs of marketing teams, news teams, sales teams, agile application teams, etc – who would prefer end user screens react to change quickly – a scalable CMS with rapid updates. After all, it’s a faster forward world and no one likes waiting any more
A rapid content update approach can support quick content changes such as fresh news, homepage updates, updates to marketing splashes, rectification of mistakes in publishing, bug fixes, on-the-fly theme updates, updates to user generated content and much more. Being able to update large numbers of users with changes in a short timeframe is, nowadays, a desirable business approach. Of course the drawback here is that the more updates to content there are then there must also be less caching, which implies that, even with the use of a CDN, the origin needs to be scaled to handle a significant number of users in order to support the ability to have rapidly updating content.
Now we are faced with a scenario – the business needs to decide how frequently the end users need to receive updates or changes, but in a time tradeoff need to scale the origin. Simply, the more frequently rapid updates happen in time, then the larger and more costly the origin infrastructure needs to be to support that.
So a compromise between the customers business teams is reached – but not until many man hours are spent across stakeholders, who will need to meet (sometimes many times) to reach a compromise of update time delay vs how long to cache for. As you can probably guess, the desire to cache for a long time in the Edge in order to reduce origin resources can pit the business needs of the infrastructure and finances teams against the business needs of the marketing, application or other teams. Not to mention the cost to the business in man-hours that just having these conversations brings.
The thing is… really there is a way to ensure everyone’s needs are met without that time and resource cost overhead.
So that aside, if we have caching of HTML in place then why would a flash crowd cause problems when and a customer has has done a basic self-integration (or has not planned in advance with Akamai)? The answer is that flash crowd types of traffic patters can cause significant origin load when the Akamai Edge servers must refresh content from the origin often. This is especially true where business requirements drive rapid updates through low cache lifetimes. In this case, the origin traffic can be significantly (and sometimes, unexpectedly) high during flash crowds. This can lead to problems and embarrassment during the first ‘call to action’ scenario and many man-hours at “battle stations” to combat, then many man hours afterwards providing reports on the situation.
So what we need to solve our problem here is an Akamai configuration that will cache content for large traffic volumes, while updating that content form origin at regular intervals but with the least amount of origin requests as possible. Tall order? Not for Akamai.
Now we should probably cover the Akamai Edge “synchronous refresh” mechanism. This is what is invoked when an object has expired in the Akamai Edge cache and the Edge needs to revalidate the requested object with the origin. Imagine for a moment that a user hits an Edge server and requests an object that is in cache, but the TTL has expired. The Edge cant serve that expired object to the end user, but it might be able to if the same object on the origin has not changed. The Edge will go forward to the origin to revalidate the object via an IMS GET. Imagine that exactly as that that Edge<>origin revalidation is happening, another 19 users hit that same Edge server and ask for the same piece of content. Those 19 new users don’t wait for the origin response to come back for the first user, and those 19 users cant be served the expired content from the cache. This implies that each of those 19 users will generate a new origin bound request from the Edge to the origin. So in order to refresh this piece of expired content that 20 users want, there are 20 requests to the origin. Now with long TTLs this is not a problem, but when you are talking about low TTLs for rapid updates to content (add flash traffic too), things can get dicey; scale this up even conservatively and suddenly the origin has to deal with potentially significant traffic - when only one single object refreshes in cache. If an object is refreshing every time (‘0s’ TTL), or low TTL, for large user bases and flash traffic this soon adds up in origin resources.
Sending many requests to an origin to refresh one object may sound a counter intuitive thing for a network such as Akamai to do, but it is actually the default – and essential – functionality of the Akamai Edge. Imagine that from the 20 users above, only 1 request was made to origin (for the first user). The remaining 19 users have not made an origin request each - they just sit and wait to share the response the first user gets. Imagine then that the origin just happened to have an error for that one user and returned a 5XX return code. The remaining 19 users would also receive that same error – in effect magnifying what was a small origin error up to affect a wider user base. With functionality like cache parenting, the magnification factor would be even larger. Akamai’s default behavior therefore is to send a new request to origin for each end user that requests an expired object that has already begun – but not yet completed - refreshing with origin. This default behavior can be changed, which we will cover later in the configuration part of this series.
Now lets come back to the origin load. We left it climbing as the CMS is crunching HTML pages and giving expired object updates to more and more users that are coming through from Akamai synchronously on 0s TTL/low TTL. As the origin load time increases, inevitably HTTP response times will climb. Now we get into a sticky situation, where those 20 users refreshing content from before, are now waiting longer than 1 second for a response. This means we are now in second 2 and the next flash crowd of, say, 30 users are also submitting their requests for the content to that same Edge server. The Edge sees the object is expired still, but not yet refreshed by the origin (the origin load is causing it to take its time now, remember). There is no choice but to submit an additional 30 requests towards the origin to refresh the content.
Scale this up over the number of your end users, the number of Edge servers and then throw origin load based response times into the mix and we are into “here be dragons” country. The situation can snowball at a time of flash traffic when everyone is hoping for the best performance.
What we need is a way to support lots of users, in flash/bursty traffic patterns – but also keep requests to the origin relatively consistent through the day, in a more uniform, predictable pattern. This could be seen as a bit of ‘pot of gold at the end of the rainbow’ for origin engineers, as predictable repeating traffic patterns make for very easy planning of like future scale, OpEx forecasts etc which are not usually easy to make when dealing with high volume traffic and large flash crowd traffic .
So how do we solve these problems together? Primarily, I would recommend always working with Akamai PS to configure correctly for high volume traffic. That being said, this series could give you some good pointers and help get things underway via the self-service route to save time.
For the rapid update content, we need to balance caching and update times. Here is a good formula I recommend;
- HTML caching: Instead of revalidating with origin for every end user request, or having a fixed low TTL of a minute or two – imagine that the vast majority of users could receive updates every 60 seconds, with a small portion of users sometimes seeing content up to 2 mins old - but with a very significant reduction in the amount of those content refresh requests made towards origin.
- Images, fonts, icons can cache for a long time, as these rarely change. In the rare occurrence that these items do change, new filenames can be used to ensure new images are picked up quickly – or Akamais CCU/CCU API can be used to force content refresh. TTLs of 7 days or 30 days are common here – though I have seen sites caching for months (or more!) for very popular, very static content.
- The origin needs to be enabled to support ‘If-Modified-Since’ HTTP GET and HTTP 304 responses.
The times specified in (1) could be adjusted upwards or downwards, but generally I find that 60 or 120 seconds TTL on the HTML content for the vast of users tends to be a sweet spot, though some customers like the 3 to 5 minute range. The number could be in the higher range if you find your business needs for updates are less frequent. The number could also be smaller but bear in mind that we are trying to keep origin requests down – lower numbers will mean more frequent refreshes from Akamai to Origin.
Lets take a look at a graph of a customer who’s CMS we have integrated to handle flash traffic peaks:
You can see the traffic is around 25k unique visitors per daytime hour, with clear flash traffic building during TV “calls to action” – up to approx. 60k unique users per hour during the later evening of the 8th, with a peak again the late morning of the 10th. On the 11th there are two peaks, one around noon the other early afternoon, but on the 12th we see a significant increase in end users (1.1m total unique), likely due to heavy media campaigning.
What if I told you this CMS driven site ensures nearly all of its users HTML, JS and CSS content is refreshing from origin every 3.5 minutes - but at no point did the origin receive more than 28 hits per second from Akamai or use more than 3mbps bandwidth?
Here we see the Akamai platform peaking at 0.5Gbps serving this Wordpress CMS site, yet origin bandwidth is generally less than one meg during everyday use.
Lets check those figures, but from a hits perspective – generally the origin receives around 10 hits per second, though does peak at up to 30 hits per second during the 12th to support the 1.1m user flash crowd
Of course, it’s advisable to not update content in the middle of a flash traffic situation as during those periods you want to keep as much offload as you can by utilizing IMS GET / HTTP 304 - but it is something that can be done. For the example site we are talking about, up to 35% of origin responses during the flash traffic are HTTP 200 to supply new/changed content to Akamai:
In part two of this blog series I’m going to walk you through some configuration settings on how to achieve something similar to the results here using Akamai’s Aqua ION product. I’m also going to cover a couple of the cool features of Aqua ION, which should be turned on as general practice.
Stay tuned for the next installment, and please feel free to follow me for future updates.
UPDATE: Part 2 available here