Refactoring DLTJ, Winter 2021 Part 2.5: Fixing the Webmentions Cache

Okay, a half-step backward to fix something I broke yesterday. As I described earlier this year, this static website blog uses the Webmention protocol to notify others when I link to their content and receive notifications from others. Behind the scenes, I'm using the Jekyll plugin called jekyll-webmention_io to integrate Webmention data into my blog's content. Each time the contents of this site is built, that plug-in contacts the Webmention.IO service to receive its Webmention data. (Webmention.IO holds onto it between Jekyll builds since there is no always-on "dltj.org" server to receive notifications from others.) The plug-in caches that information to ease the burden on the Webmention.IO service.

The previous CloudFormation-based process was using AWS CodeBuild natively, and the Webmention cache was stored in CodeBuild's caching function. CodeBuild automatically downloads the previous cache into the working directory for each build iteration and then automatically uploads the cache as the build is completed. Handy, right?

Well, AWS Amplify simplifies some of the setup of working with the underlying CodeBuild tool. One of the configuration options that is no longer available is the ability to specify which S3 bucket to use as the CodeBuild cache; so I couldn't point it at the previous cache files and all of the previous Webmention entries no longer appeared on the blog pages. Fortunately, I hadn't decommissioned the CloudFormation stuff, so I still had access to the old cache; I was able to extract the four webmention files (but see below for a discussion about that).

Since Amplify doesn't allow me to have direct access to the CodeBuild cache, I decided it was high time to use a dedicated cache location for these webmention files. To do that took three steps: 1. Create the S3 bucket (with no public access) 2. Add read/write policy for that bucket to the AWS role assigned to the Amplify app 3. Add lines to the amplify.yml file to copy files from the S3 bucket into and out of the working directory

For step 2, the IAM policy for the Amplify role:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:DeleteObject",
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": "arn:aws:s3:::org.dltj.webmentions-cache"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "s3:ListAllMyBuckets"
            ],
            "Resource": "*"
        }
    ]
}

For the amplify.yml file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
version: 1
frontend:
  phases:
    preBuild:
      commands:
        - aws s3 cp s3://org.dltj.webmentions-cache webmentions-cache --recursive
        - rvm use $VERSION_RUBY_2_6
        - bundle install --path vendor/bundle
    build:
      commands:
        - rvm use $VERSION_RUBY_2_6
        - bundle exec jekyll build --trace
    postBuild:
      commands:
        - aws s3 cp webmentions-cache s3://org.dltj.webmentions-cache --recursive
  artifacts:
    baseDirectory: _site
    files:
      - '**/*'
  cache:
    paths:
      - 'vendor/**/*'

And the webmentions part of the Jekyll _config.yml file:

1
2
webmentions:
  cache_folder: webmentions-cache

Contents of the AWS CodeBuild Cache File

Can we do a quick sidebar on the AWS CodeBuild caching mechanism? Because I was not expecting what I saw. The CodeBuild cache S3 bucket contains one file with a UUID as its name. That file is a tar-gzip'd archive of a flat directory containing sequentially numbered files (0 through 8099 in my case) and a codebuild.json table of contents:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
{
  "version": "1.0",
  "content": {
    "files": [
      {
        "path": "vendor/s3deploy.tar.gz",
        "rel": "src"
      },
      {
        "path": "vendor/s3deploy",
        "rel": "src"
      },
      {
        "path": "vendor/LICENSE",
        "rel": "src"
      },
      {
        "path": "vendor/README.md",
        "rel": "src"
      },
      {
        "path": "vendor/webmentions",
        "rel": "src"
      },
      {
        "path": "vendor/webmentions/received.yml",
        "rel": "src"
      },
      {
        "path": "vendor/webmentions/lookups.yml",
        "rel": "src"
      },
      {
        "path": "vendor/webmentions/bad_uris.yml",
        "rel": "src"
      },
      {
        "path": "vendor/webmentions/outgoing.yml",
        "rel": "src"
      },
    ...

Each item in the files array corresponded to the numbered filename in the directory. (In the case of the 4th item in the array—a directory—there was no corresponding file in the tar-gzip archive.) Fortunately, the four files I was looking for were near the top of the list and I didn't have to go hunting through all eight-thousand-some-odd files to find them. (The s3deploy program is one that I found to intelligently copy modified files from the CodeBuild working directory to the S3 static website bucket.)

I'm really wondering about the engineering requirements for all of this overhead. Why not just use a native tar-gzip archive without the process of parsing the table of contents and renaming the files?