drud.is

Cache is king

The old investment adage says cash is king. In networking, cache is king. And you should learn to cache like a king.
I’ll show why it matters, a few things to know, and I’ll share my ad-hoc solution to keep my cache warm.

Why caching matters

There are many reasons why you want to cache HTTP, be it a website or an application that downloads assets over the web.

First, resource optimization and cost:

Second, availability: if your server goes through a blip, the cache can continue serving the content; your audience won’t notice.

Finally, performance and engagement: cached assets download faster. It’s been reported that a 100-millisecond delay in website load time hurts conversion rates by 7%. Even if you are not “selling” or “converting”, take it as a sign of how much users care. Milliseconds count.

Get ready to cache

It is crucial that your application sets the appropriate caching headers. Everything we are going to discuss are the tools and infrastructure to support the caching policy you set. Without a proper policy, your content defaults to non-cacheable, making the rest of your efforts futile. Read https://web.dev/articles/http-cache if you need a refresher.

You should have a strategy for expiring long-lived caches. For assets that are supposed to be semi-static, you want very long cache periods. But what happens when there’s a change? Say you run a link-in-bio service where user profiles are updated rarely. You might want a 1-year cache policy, but if a customer happens to update their page, they will expect their update to be served immediately. Serving the previous version two hours after the update would be an unacceptable customer experience.

Every CDN vendor has a default cache, and a plethora of paid additional services to enhance it further. Cloudflare has Tiered Cache and Cache Reserve, AWS has CloudFront Origin Shield, etc. Do your research and figure out what makes sense for you. It’s outside the scope of this post. I personally don’t use any.

Keeping a great cache-hit ratio

I’d like to keep the cache of my hobbyist websites warm. They don’t get a lot of traffic, but I still want the best experience for my visitor. The server is in Canada and behind the Cloudflare CDN.
The challenge is I want to keep the cache-hit ratio high, thus I want to prewarm the cache. That is to say I want to have these pages fresh in the cache, so that all users will benefit even in the first request.
This is problematic to achieve, because caches are spread throughout the world and having a warm cache in Los Angeles is useless to a visitor from Barcelona.

Approaches considered

I thought of two approaches:

I was initially leaning towards the VPN option because of the granularity, but the process of connecting and disconnecting to each geography seemed potentially unreliable.
Cloudflare workers sounded great as I’m already in Cloudflare. But I didn’t figure out how to force running them from a specific location, defeating the point of the exercise.
AWS Lambda was the solution I picked. You can deploy and run the function in each of the AWS regions you care about. The granularity is coarse, but enough to test out the idea. The free tier is enough for my purpose.

Selecting what to warm up

My website has a very long tail. I can’t hold everything in cache, it wouldn’t make any sense even if I could. I want to keep everything that makes up the first page, and one or two levels deep.
The problem is its content is quite dynamic, it changes often. It also uses Next.js, which uses different routes for the javascript bundles every time it deploys a new build.

So, I will periodically build a list of what needs to be kept warm. I need to know what’s being linked from the first page, as I also care about these links. Next.js builds the DOM dynamically with JavaScript and you can’t just parse the downloaded HTML. I use headless Chrome to download and render the page, as it can interpret JavaScript. Then I extract the links from this extracted DOM.

Here’s the part where I do that. I run this script in docker using zenika/alpine-chrome:

htmlContent=$(chromium-browser \
    --headless \
    --no-sandbox \
    --timeout=10000 \
    --virtual-time-budget=10000 \
    --window-size=675,11667 \
    --run-all-compositor-stages-before-draw \
    --enable-javascript \
    --javascript-delay=5000 \
    --wait-until=networkidle0 \
    --dump-dom \
    --disable-software-rasterizer \
    --disable-dev-shm-usage \
    --disable-gpu \
    --user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36 XXX/1.0" \
    https://$DOMAIN/ 2> /dev/null)

I’ll parse the resulting HTML, produce the list of targeted files, and upload it to R2 (Cloudflare’s equivalent to S3). The Lambda function will retrieve it to know the list of URLs that need to be kept fresh.

Running the Lambda function

I implemented the Lambda function in Go using the arm runtime provided.al2023. Here is the skeleton:

import "github.com/aws/aws-lambda-go/lambda"

type Event struct {
	URL   string `json:"seed_url"`
}

func handleRequest(ctx context.Context, event Event) (Response, error) {
	client := createHTTPClient()

	listResult := fetchURL(ctx, event.URL, client, 0)
	if listResult.StatusCode != http.StatusOK {
		return Response{ Results: []URLResult{listResult}, }, nil
	}

	urls, err := fetchURLList(ctx, event.URL, client)
	if err != nil {
		...
	}

	sem := make(chan struct{}, maxConcurrentDownloads)

	resultsChan := make(chan URLResult, len(urls))
	var wg sync.WaitGroup

	for _, url := range urls {
		wg.Add(1)
		sem <- struct{}{}
		go func(url string) {
			defer wg.Done()
			defer func() { <-sem }()
			result := fetchURL(ctx, url, client, 60*time.Second)
			resultsChan <- result
		}(url)
	}

	go func() {
		wg.Wait()
		close(resultsChan)
	}()

	var results []URLResult
	for result := range resultsChan {
		results = append(results, result)
	}
	return Response{Results: results}, nil
}

func fetchURLList(ctx context.Context, listURL string, client *http.Client) ([]string, error) {
	req, err := http.NewRequestWithContext(ctx, "GET", listURL, nil)
...
	return urls, nil
}

func main() {
	lambda.Start(handleRequest)
}

And then run this script periodically (using docker again with this):

for region in us-east-1 us-east-2 eu-south-2 us-west-1 us-west-2 ; do
  aws --profile lambda-dev --region $region lambda invoke \
      --function-name cache-prewarm \
      --payload '{ "seed_url": "https://.../list_files" }' \
      --cli-binary-format raw-in-base64-out \
      --invocation-type Event \
      /dev/stdout > /dev/null
done

Challenges