Tuesday, December 31, 2013

Management of Cache Defeating URLs

What is server side cache?

Server side cache is a repository where dynamic content is rendered and stored as static content.  The example we will use in this article is a landing page for email campaigns that pulls in the current sale items for the week.  Since the content changes each week can save the cached version for a week before discarding it.

You develop your content and products for the page and it has a URL of http://www.awesomesite.com/week-1-sales.html. When you call this URL for the first time, it takes 2 seconds to generate page and place it into server cache.  The first run has to go to the database and find all the information on the products part of this weeks sale.

Once the static content is assembled the cache on the server side makes an entry for /week-1-sales.html and stores it in an easy to retrieve fashion.  The next time you call that same URL it loads nearly instantly in lets say 500ms which shaves off 75% of the time to build that page.   

This is an excellent example of how Cache in Magento, Demandware, Drupal, Varnish and nearly any other caching mechanism works.

What is a cache defeating URL?

A cache defeating URL is a unique URL to an otherwise static resources.  This may sound complicated but it is a very easy concept to understand.  Marketing campaigns by far generate the most amount of cache defeating URLs.  So how do they do it?

When someone sets up a marketing campaign they always want to know how many people opened the email, clicked on the link and perhaps even purchased a product.  Most bulk email providers can track that through the entire process including the sale down to the individual.  This allows the business unit the insight to know who are their best customers and how effective a campaign is based on the user demographics.

Gathering this information is done by the email containing a unique identifier to that customer.  In some cases it is the users email address.  When you hover over a link in your email you may see something like http://cc.mailprovider.com/clickThrough?cID=HDY7S&email=user@isp.com. You can see that the URL is not going to www.awesomesite.com yet.  It is in fact going to the mail provider to track that you opened the email.
Once you arrive at the mail provider’s website you will be redirected to your actual landing page.  The URL will often look something like this


You can see that it passed along the campaign ID (cID) the users email, and what group (grp) it went out in.  Javascript embedded on your website and in your PHP code picks up on these values and stores them into a cookie or session.   This is used while you visit the website to track anything from what you viewed to what you purchased.

So how does this affect the cache?

Remember that our cache was built off of the URL


However, the mail provider requested


We know by looking at it that it simply adding on the tracking information to end.  We know that despite the additional parameters the actual content on the page will not change at all.  

The web application and the web cache have no way to know if those new parameters will result in different content.  The result is that the page will be re-generated taking a full 2 seconds with those new parameters.  The content will look exactly the same however it will now have a a copy in cache with those parameters.  If we send the same email campaign to 5 users the cache will now look like this:

http://www.awesomesite.com/week-1-sales.html?cID=HDY7S&email=user@isp.com&grp=3 http://www.awesomesite.com/week-1-sales.html?cID=HDY7S&email=abcr@isp.com&grp=3

The amount of CPU time to create all 6 pages is 12 seconds.  In contrast, if those same users all went to the http://www.awesomesite.com/week-1-sales.html only, the total CPU time would be 4.5 seconds.  Imagine the amount of time it would take if an email blast to 20,000 visitors resulted in 500 people opening the email at the same time.  With 64 cores on the box you would have far more requests than the server could process.  These tracking URLs defeat the cache and eventually will cause the server to stop honoring requests.

Mail providers are not the only cause of cache defeating URLs.  Referrals can be another source.  If your site participates in referrals then you will often see a refID parameters in your URLs.   One example might be http://www.awesomesite.com/week-1-sales.html?refID=298374

Managing the problem

The largest challenge in managing this problem is understanding why it happens.  Hopefully after reading this far you now understand how important web cache is.  So what can we do now to mitigate the issue. Magento carts have a big issue with this in particular.

Know your traffic

First do an analysis on your web logs for a specific URL.  A sysadmins can easily take a few weeks of web logs and tell you what parameters they see coming across for a particular URL.  Once you have the list find out who is sending it and why.  This audit will take a little bit of time but can be valuable in keeping your website running after a great marketing campaign.

Throttle Traffic

This is a simple and effective way to prevent a server crash.  It won’t prevent unnecessary cache build up but spreading a 20,000 user email blast over 2 or 3 hours will keep your site running.  This is not always a viable option when you have a time sensitive promotion on your site.

Static Landing Pages

If your marketing blast or referral tracking numbers rely only on JavaScript to read the parameters you can create a static .html file with that JavaScript on the static page.  Once the cookies are set via JavaScript the user can proceed to another page without issue.

Page Cache Filter The last part is filtering out at a code level the parameters sent. This means often tinkering with the page cache itself and the way it is called. The company I work for Lyons Consulting Group has developed just this type of Page Cache filter for Magento. This allows the Magento cart to operate with virtually any mail vendor (Listrak, etc.) to submit parameters that can removed from the cache.