Cache is king

Nov 24, 2024

The old investment adage says cash is king. In networking, cache is king. And you should learn to cache like a king.
I’ll show why it matters, a few things to know, and I’ll share my ad-hoc solution to keep my cache warm.

Why caching matters

There are many reasons why you want to cache HTTP, be it a website or an application that downloads assets over the web.

First, resource optimization and cost:

Caching will reduce the server load and bandwidth usage
As a result, it can reduce the number of necessary hosts (and thus the cost) from your cluster
If the origin is Cloud storage (S3 or the likes), it will reduce your access and egress fees

Second, availability: if your server goes through a blip, the cache can continue serving the content; your audience won’t notice.

Finally, performance and engagement: cached assets download faster. It’s been reported that a 100-millisecond delay in website load time hurts conversion rates by 7%. Even if you are not “selling” or “converting”, take it as a sign of how much users care. Milliseconds count.

Get ready to cache

It is crucial that your application sets the appropriate caching headers. Everything we are going to discuss are the tools and infrastructure to support the caching policy you set. Without a proper policy, your content defaults to non-cacheable, making the rest of your efforts futile. Read https://web.dev/articles/http-cache if you need a refresher.

Cache-Control: set the cache policy
Last-Modified: conditional retrieval, only download it if the content has been updated recently
ETag: similar to Last-Modified, but based on content rather than date
stale-while-revalidate: allow serving slightly stale content and refresh in the background for the next request

You should have a strategy for expiring long-lived caches. For assets that are supposed to be semi-static, you want very long cache periods. But what happens when there’s a change? Say you run a link-in-bio service where user profiles are updated rarely. You might want a 1-year cache policy, but if a customer happens to update their page, they will expect their update to be served immediately. Serving the previous version two hours after the update would be an unacceptable customer experience.

Every CDN vendor has a default cache, and a plethora of paid additional services to enhance it further. Cloudflare has Tiered Cache and Cache Reserve, AWS has CloudFront Origin Shield, etc. Do your research and figure out what makes sense for you. It’s outside the scope of this post. I personally don’t use any.

Keeping a great cache-hit ratio

I’d like to keep the cache of my hobbyist websites warm. They don’t get a lot of traffic, but I still want the best experience for my visitor. The server is in Canada and behind the Cloudflare CDN.
The challenge is I want to keep the cache-hit ratio high, thus I want to prewarm the cache. That is to say I want to have these pages fresh in the cache, so that all users will benefit even in the first request.
This is problematic to achieve, because caches are spread throughout the world and having a warm cache in Los Angeles is useless to a visitor from Barcelona.

Approaches considered

I thought of two approaches:

Getting a VPN, connecting to different geographies (they even have city granularity) and running the requests through these routes. If you connect through a given city, it should route the requests to the closest Cloudflare CDN Point-of-presence (POP), which is the same one that other users in the metro area will use.
Cloudflare workers or AWS Lambda functions

I was initially leaning towards the VPN option because of the granularity, but the process of connecting and disconnecting to each geography seemed potentially unreliable.
Cloudflare workers sounded great as I’m already in Cloudflare. But I didn’t figure out how to force running them from a specific location, defeating the point of the exercise.
AWS Lambda was the solution I picked. You can deploy and run the function in each of the AWS regions you care about. The granularity is coarse, but enough to test out the idea. The free tier is enough for my purpose.

Selecting what to warm up

My website has a very long tail. I can’t hold everything in cache, it wouldn’t make any sense even if I could. I want to keep everything that makes up the first page, and one or two levels deep.
The problem is its content is quite dynamic, it changes often. It also uses Next.js, which uses different routes for the javascript bundles every time it deploys a new build.

So, I will periodically build a list of what needs to be kept warm. I need to know what’s being linked from the first page, as I also care about these links. Next.js builds the DOM dynamically with JavaScript and you can’t just parse the downloaded HTML. I use headless Chrome to download and render the page, as it can interpret JavaScript. Then I extract the links from this extracted DOM.

Here’s the part where I do that. I run this script in docker using zenika/alpine-chrome:

htmlContent=$(chromium-browser \
    --headless \
    --no-sandbox \
    --timeout=10000 \
    --virtual-time-budget=10000 \
    --window-size=675,11667 \
    --run-all-compositor-stages-before-draw \
    --enable-javascript \
    --javascript-delay=5000 \
    --wait-until=networkidle0 \
    --dump-dom \
    --disable-software-rasterizer \
    --disable-dev-shm-usage \
    --disable-gpu \
    --user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36 XXX/1.0" \
    https://$DOMAIN/ 2> /dev/null)

I’ll parse the resulting HTML, produce the list of targeted files, and upload it to R2 (Cloudflare’s equivalent to S3). The Lambda function will retrieve it to know the list of URLs that need to be kept fresh.

Running the Lambda function

I implemented the Lambda function in Go using the arm runtime provided.al2023. Here is the skeleton:

import "github.com/aws/aws-lambda-go/lambda"

type Event struct {
	URL   string `json:"seed_url"`
}

func handleRequest(ctx context.Context, event Event) (Response, error) {
	client := createHTTPClient()

	listResult := fetchURL(ctx, event.URL, client, 0)
	if listResult.StatusCode != http.StatusOK {
		return Response{ Results: []URLResult{listResult}, }, nil
	}

	urls, err := fetchURLList(ctx, event.URL, client)
	if err != nil {
		...
	}

	sem := make(chan struct{}, maxConcurrentDownloads)

	resultsChan := make(chan URLResult, len(urls))
	var wg sync.WaitGroup

	for _, url := range urls {
		wg.Add(1)
		sem <- struct{}{}
		go func(url string) {
			defer wg.Done()
			defer func() { <-sem }()
			result := fetchURL(ctx, url, client, 60*time.Second)
			resultsChan <- result
		}(url)
	}

	go func() {
		wg.Wait()
		close(resultsChan)
	}()

	var results []URLResult
	for result := range resultsChan {
		results = append(results, result)
	}
	return Response{Results: results}, nil
}

func fetchURLList(ctx context.Context, listURL string, client *http.Client) ([]string, error) {
	req, err := http.NewRequestWithContext(ctx, "GET", listURL, nil)
...
	return urls, nil
}

func main() {
	lambda.Start(handleRequest)
}

And then run this script periodically (using docker again with this):

for region in us-east-1 us-east-2 eu-south-2 us-west-1 us-west-2 ; do
  aws --profile lambda-dev --region $region lambda invoke \
      --function-name cache-prewarm \
      --payload '{ "seed_url": "https://.../list_files" }' \
      --cli-binary-format raw-in-base64-out \
      --invocation-type Event \
      /dev/stdout > /dev/null
done

Challenges

Waste: As I’ve described it, unmodified pages will be downloaded over and over again from the Lambda. You should optimize by using Last-Modified or ETags. In such a case, you will only get a “304 Not Modified” if the content hasn’t been modified, saving the whole download. It’s challenging as Lambdas themselves are stateless. You will need to use persistence if you want to implement this properly, DynamoDB seems a good candidate.
Cost: Downloads to AWS Lambda are free, and egress out of R2 buckets is also free. But be careful if your origin charges by request or bandwidth (e.g. S3).
Bot detection: At first my download attempts were blocked by the Cloudflare bot detection. When I showed how to invoke headless Chrome you might have seen XXX/1.0 in the User-Agent. That’s the trick I use. I configured Cloudflare rules to skip bot detection when the User-Agent contains my specific tag for this use-case.