Simon Newton

Deploying a scalable CMS with frequent content updates using Aqua ION

Blog Post created by Simon Newton Employee on Sep 14, 2015

Content management systems, such as Wordpress, are pervasive, functionally rich and have good community development, however they are generally not known for their ability to scale massively with ease. The benefits of Content Delivery Networks and caching to scale CMS and other applications are well proven. Akamai’s products are well suited - and industry leading  - at scaling applications; minimizing the origin load, both in bandwidth volume and hits per second and, by extension, offering significant cost savings. As anyone who follows our technology knows, Akamai’s Edge Network offers unparalleled reach into end user networks, the globe over.

 

Caching content closer to your end users plays a large part in reducing origin traffic, accelerating end user experience and dealing with flash traffic (though at Akamai, we spend a lot of our time finding additional ways to make your customers experience faster, for example thorough technologies such as Akamai Adaptive Image Compression). While we are on the topic of caching -Akshay Ranganath has done a nice write up on how caching in Akamai can increase performance; https://community.akamai.com/community/web-performance/blog/2015/08/12/measuring-impact-of-page-caching

 

With caching though, there is a catch; there is a time trade off. The longer you cache a piece of content in Akamai, the longer in time end users must wait to see changes to your site. Longer cache lifetimes appeal to the infrastructure and finance teams as higher Akamai offload means less CapEx and OpEx in infrastructure to handle true end user load.  In contrast, longer cache lifetimes tend not to appeal to the business needs of marketing teams, news teams, sales teams, agile application teams, etc – who would prefer end user screens react to change quickly – a scalable CMS with rapid updates.  After all, it’s a faster forward world and no one likes waiting any more

 

A rapid content update approach can support quick content changes such as fresh news, homepage updates, updates to marketing splashes, rectification of mistakes in publishing, bug fixes, on-the-fly theme updates, updates to user generated content and much more. Being able to update large numbers of users with changes in a short timeframe is, nowadays, a desirable business approach. Of course the drawback here is that the more updates to content there are then there must also be less caching, which implies that, even with the use of a CDN, the origin needs to be scaled to handle a significant number of users in order to support the ability to have rapidly updating content.

 

Now we are faced with a scenario – the business needs to decide how frequently the end users need to receive updates or changes, but in a time tradeoff need to scale the origin. Simply, the more frequently rapid updates happen in time, then the larger and more costly the origin infrastructure needs to be to support that.

 

So a compromise between the customers business teams is reached – but not until many man hours are spent across stakeholders, who will need to meet (sometimes many times) to reach a compromise of update time delay vs how long to cache for. As you can probably guess, the desire to cache for a long time in the Edge in order to reduce origin resources can pit the business needs of the infrastructure and finances teams against the business needs of the marketing, application or other teams. Not to mention the cost to the business in man-hours that just having these conversations brings.

 

The thing is… really there is a way to ensure everyone’s needs are met without that time and resource cost overhead.

 

When my customers have relayed to me that they have faced similar discussions, the outcome seems to be a “happy medium” – for example HTML / Javascript and CSS will be cached for short periods in Akamai – values from 1 to 5 minutes are common. Assets such as images, fonts, icons, media clips are cached for a longer lifetime – 7 days and 30 days are common values I see for this category. Sometimes customers choose to refresh HTML content every time an end user asks Akamai for it (the “0 second” TTL – or using ‘no-store’ with Akamai SureRoute) - which can make the marketing team happy, but if a flash crowd arrives it can make the infrastructure team most unhappy as the origin will have a very noticeable increase in load – sometimes, embarrassingly, too much. Personally, I would advise a customer against using a “0s TTL” or “no-store” approach on any high volume CMS where flash traffic is expected as the costs of origin infrastructure required to serve tens of thousands of unique users per hour in a timely manner can be staggeringly high (my life before Akamai was in large scale infrastructure and applications, so know this pain all too well).

 

Something that is definitely worth noting at this point is that I recommend longer cache lifetimes on JS and CSS objects when the need for rapid updates to the look and feel of the site is deemed to be rare/unlikely. Lets imagine that your front end/UI team only change the Javascript and CSS every month or two. In that case, the Javascript and CSS would then go into something like the 7 day /30day rule inside Akamai Property Manager for longer caching. If an update to the look and feel then needs to be made rapidly in an emergency, the Akamai CCU or CCU APi is used to force a refresh of the front end/UI assets.

 

So that aside, if we have caching of HTML in place then why would a flash crowd cause problems when and a customer has has done a basic self-integration (or has not planned in advance with Akamai)? The answer is that flash crowd types of traffic patters can cause significant origin load when the Akamai Edge servers must refresh content from the origin often. This is especially true where business requirements drive rapid updates through low cache lifetimes. In this case, the origin traffic can be significantly (and sometimes, unexpectedly) high during flash crowds. This can lead to problems and embarrassment during the first ‘call to action’ scenario and many man-hours at “battle stations” to combat, then many man hours afterwards providing reports on the situation.

 

So what we need to solve our problem here is an Akamai configuration that will cache content for large traffic volumes, while updating that content form origin at regular intervals but with the least amount of origin requests as possible. Tall order? Not for Akamai.

 

To get there, first there are a few things to consider. Firstly, most CMS systems tend to be computationally intensive generating the actual HTML pages themselves. As a general rule, the app server spends more of its thinking time on generating the HTML than it does on dishing out CSS, Javascript or images. So, even with a high offload on JS/CSS/Image assets, the origin can still find itself having to do a lot of thinking just servicing the average user. We also need to bear in mind that the more requests for computationally expensive HTML pages at the origin, then the load is going to start climbing - which of course has a knock on effect lengthening HTTP response times.  Lets park that thought for the moment and come back to it shortly.

 

Now we should probably cover the Akamai Edge “synchronous refresh” mechanism. This is what is invoked when an object has expired in the Akamai Edge cache and the Edge needs to revalidate the requested object with the origin. Imagine for a moment that a user hits an Edge server and requests an object that is in cache, but the TTL has expired. The Edge cant serve that expired object to the end user, but it might be able to if the same object on the origin has not changed. The Edge will go forward to the origin to revalidate the object via an IMS GET. Imagine that exactly as that that Edge<>origin revalidation is happening, another 19 users hit that same Edge server and ask for the same piece of content. Those 19 new users don’t wait for the origin response to come back for the first user, and those 19 users cant be served the expired content from the cache. This implies that each of those 19 users will generate a new origin bound request from the Edge to the origin. So in order to refresh this piece of expired content that 20 users want, there are 20 requests to the origin. Now with long TTLs this is not a problem, but when you are talking about low TTLs for rapid updates to content (add flash traffic too), things can get dicey; scale this up even conservatively and suddenly the origin has to deal with potentially significant traffic - when only one single object refreshes in cache. If an object is refreshing every time (‘0s’ TTL), or low TTL, for large user bases and flash traffic this soon adds up in origin resources.

 

Sending many requests to an origin to refresh one object may sound a counter intuitive thing for a network such as Akamai to do, but it is actually the default – and essential – functionality of the Akamai Edge. Imagine that from the 20 users above, only 1 request was made to origin (for the first user).  The remaining 19 users have not made an origin request each - they just sit and wait to share the response the first user gets. Imagine then that the origin just happened to have an error for that one user and returned a 5XX return code. The remaining 19 users would also receive that same error – in effect magnifying what was a small origin error up to affect a wider user base. With functionality like cache parenting, the magnification factor would be even larger. Akamai’s default behavior therefore is to send a new request to origin for each end user that requests an expired object that has already begun – but not yet completed - refreshing with origin. This default behavior can be changed, which we will cover later in the configuration part of this series.

 

Now lets come back to the origin load. We left it climbing as the CMS is crunching HTML pages and giving expired object updates to more and more users that are coming through from Akamai synchronously on 0s TTL/low TTL. As the origin load time increases, inevitably HTTP response times will climb. Now we get into a sticky situation, where those 20 users refreshing content from before, are now waiting longer than 1 second for a response. This means we are now in second 2 and the next flash crowd of, say, 30 users are also submitting their requests for the content to that same Edge server. The Edge sees the object is expired still, but not yet refreshed by the origin (the origin load is causing it to take its time now, remember). There is no choice but to submit an additional 30 requests towards the origin to refresh the content.

 

Scale this up over the number of your end users, the number of Edge servers and then throw origin load based response times into the mix and we are into “here be dragons” country. The situation can snowball at a time of flash traffic when everyone is hoping for the best performance.

 

What we need is a way to support lots of users, in flash/bursty traffic patterns – but also keep requests to the origin relatively consistent through the day, in a more uniform, predictable pattern. This could be seen as a bit of  ‘pot of gold at the end of the rainbow’ for origin engineers, as predictable repeating traffic patterns make for very easy planning of like future scale, OpEx forecasts etc which are not usually easy to make when dealing with high volume traffic and large flash crowd traffic .

 

So how do we solve these problems together? Primarily, I would recommend always working with Akamai PS to configure correctly for high volume traffic.  That being said, this series could give you some good pointers and help get things underway via the self-service route to save time.

 

Before you begin configuring, its best to get your business buy in to the caching strategy we want to build around. It is also important to figure out if your application team wish to make regular, rapid updates to the look and feel of the site. If yes, your Javascript and CSS will reside in the same Akamai Property Manager caching rule, and have the same features enabled, as your HTML resides in. If not, then your Javascript and CSS will reside in the same caching rule as your images, fonts, icons etc reside in. If the latter case, and an urgent update is required to content with higher TTLs, Akamai’s CCU or CCU APi can be utilized.

 

For the rapid update content, we need to balance caching and update times. Here is a good formula I recommend;

 

  1. HTML caching: Instead of revalidating with origin for every end user request, or having a fixed low TTL of a minute or two – imagine that the vast majority of users could receive updates every 60 seconds, with a small portion of users sometimes seeing content up to 2 mins old - but with a very significant reduction in the amount of those content refresh requests made towards origin.
  2. Images, fonts, icons can cache for a long time, as these rarely change. In the rare occurrence that these items do change, new filenames can be used to ensure new images are picked up quickly – or Akamais CCU/CCU API can be used to force content refresh. TTLs of 7 days or 30 days are common here – though I have seen sites caching for months (or more!) for very popular, very static content.
  3. The origin needs to be enabled to support ‘If-Modified-Since’ HTTP GET and HTTP 304 responses.

 

 

The times specified in (1) could be adjusted upwards or downwards, but generally I find that 60 or 120 seconds TTL on the HTML content for the vast of users tends to be a sweet spot, though some customers like the 3 to 5 minute range. The number could be in the higher range if you find your business needs for updates are less frequent.  The number could also be smaller but bear in mind that we are trying to keep origin requests down – lower numbers will mean more frequent refreshes from Akamai to Origin.

 

Lets take a look at a graph of a customer who’s CMS we have integrated to handle flash traffic peaks:

 

Screen Shot 2015-09-14 at 11.40.31 AM.png

 

You can see the traffic is around 25k unique visitors per daytime hour, with clear flash traffic building during TV “calls to action” – up to approx. 60k unique users per hour during the later evening of the 8th, with a peak again the late morning of the 10th. On the 11th there are two peaks, one around noon the other early afternoon, but on the 12th we see a significant increase in end users (1.1m total unique), likely due to heavy media campaigning.

 

What if I told you this CMS driven site ensures nearly all of its users HTML, JS and CSS content is refreshing from origin every 3.5 minutes - but at no point did the origin receive more than 28 hits per second from Akamai or use more than 3mbps bandwidth?

 

Here we see the Akamai platform peaking at 0.5Gbps serving this Wordpress CMS site, yet origin bandwidth is generally less than one meg during everyday use.

Screen Shot 2015-09-14 at 11.46.52 AM.png

 

Lets check those figures, but from a hits perspective – generally the origin receives around 10 hits per second, though does peak at up to 30 hits per second during the 12th to support the 1.1m user flash crowd


Screen Shot 2015-09-14 at 11.47.04 AM.png

 

Of course, it’s advisable to not update content in the middle of a flash traffic situation as during those periods you want to keep as much offload as you can by utilizing IMS GET / HTTP 304 - but it is something that can be done. For the example site we are talking about, up to 35% of origin responses during the flash traffic are HTTP 200 to supply new/changed content to Akamai:

 

Screen Shot 2015-09-14 at 11.48.24 AM.png

 

In part two of this blog series I’m going to walk you through some configuration settings on how to achieve something similar to the results here using Akamai’s Aqua ION product. I’m also going to cover a couple of the cool features of Aqua ION, which should be turned on as general practice.

 

Stay tuned for the next installment, and please feel free to follow me for future updates.

 

UPDATE: Part 2 available here

Outcomes