Local symlinks as HTTP redirects in S3

I recently had a situation where I wanted to copy a local directory to S3 and support symlinks as HTTP 301 redirects. I wrote some Python code that can do the copy in both directions – it find symlinks in the local filesystem and creates redirect objects in S3, and finds redirect objects in S3 and creates symlinks on the local filesystem.

Redirect metadata in S3 requires that the bucket is configured for HTTP hosting. My implementation will print a warning and refuse to upload any symlinks that point outside of the local directory, and similarly will print a warning and refuse to download any redirects that point to http:// or https:// URLs.

I have only a pair of one-way force-push/force-pull functions. It doesn’t synchronize changes made to both sides at the same time; you have to choose to overwrite S3 with local changes or local changes with S3. All files not on the source side are removed from the destination side.

It calculates an md5sum for each file and stores it in S3. For some files, we could use the Etag, which S3 generates itself, and which for files that aren’t uploaded using multipart mode is actually the md5sum of the file contents. However, for files uploaded in multipart mode, it’s the md5sum of something else, so the easiest thing to do is just calculate it and it in a metadata property dedicated to that purpose.

(It’s also worth understanding that you can upload files to S3 with a Content-MD5 header. This tells S3 to fail the upload if the bits it receives don’t hash to the md5 supplied. However, S3 doesn’t store that for retrieval later, so we still have to add the md5sum separately in the metadata for each file. The code I wrote uses upload_file(), which should handle integrity automatically, so I don’t bother with Content-MD5.)

It lists all files in S3 before doing a copy in either direction. I add a simple sanity check to this – if it’s doing more than 1m API requests (the number is tunable), it will raise an error. This is entirely too large to happen in the project I’m using it in, but other projects will want to tune it.

The code was written for psyopsOS, see the current tip of the master branch or the code as it was committed. Included below is a very lightly edited version.

Click to show s3forcecopy.py

"""Manage an S3 bucket by force-copying a local directory to it."""

import logging
import hashlib
import os
from pathlib import Path

import boto3

logger = logging.getLogger(__name__)


def makesession(aws_access_key_id, aws_secret_access_key, aws_region):
    """Return a boto3 session"""
    if aws_access_key_id is None:
        aws_access_key_id = os.environ["AWS_ACCESS_KEY_ID"]
    if aws_secret_access_key is None:
        aws_secret_access_key = os.environ["AWS_SECRET_ACCESS_KEY"]
    session = boto3.session.Session(
        aws_access_key_id=aws_access_key_id,
        aws_secret_access_key=aws_secret_access_key,
    )
    logger.debug(f"Created boto3 session")
    return session


class S3PriceWarningError(Exception):
    pass


def s3_list_remote_files(
    session: boto3.Session,
    bucket_name: str,
    too_many_requests=1_000_000,
) -> tuple[dict[str, str], dict[str, str]]:
    """Return a list of all remote files.

    Return a tuple[{filepath: MD5}, {filepath: WebsiteRedirectLocation}].
    The first item in the tuple represents regular files mapped to their MD5 checksum.
    The second item in the tuple represents symlinks mapped to their target.

    We assume the md5 checksums are stored in the metadata of the S3 objects.
    We do NOT use the ETag, which is not the MD5 checksum of the file contents
    for files uploaded as multipart uploads.
    Large files like our OS tarballs are uploaded as multipart uploads automatically by boto3.

    Arguments:
    - session: boto3 session
    - bucket_name: name of the S3 bucket
    - too_many_requests: maximum number of requests before raising a S3PriceWarningError
        At the time of this writing, 1m requests is $0.40.
    """

    s3 = session.client("s3")

    files = {}
    redirects = {}

    paginator = s3.get_paginator("list_objects_v2")
    objctr = 0
    api_calls_per_obj = 2  # one to .get() and one to .head_object()
    logger.debug(f"Listing objects in S3 bucket {bucket_name}...")
    for page in paginator.paginate(Bucket=bucket_name):
        for obj in page.get("Contents", []):
            objctr += 1
            callctr = objctr * api_calls_per_obj
            filepath = obj["Key"]
            head = s3.head_object(Bucket=bucket_name, Key=obj["Key"])
            redirect = head.get("WebsiteRedirectLocation", "")
            md5sum = head["Metadata"].get("md5", "")
            if redirect:
                redirects[filepath] = redirect
            else:
                files[filepath] = md5sum
            if callctr > too_many_requests:
                raise S3PriceWarningError(
                    f"We made (at least) {too_many_requests} requests, but haven't enumerated all the files. Raise the too_many_requests linmit if you're sure you want to do this."
                )
    logger.debug(
        f"Found {len(files)} files and {len(redirects)} redirects in S3 bucket {bucket_name}."
    )

    return (files, redirects)


def compute_md5(file: Path):
    """Compute the MD5 checksum of a file"""
    hash_md5 = hashlib.md5()
    with file.open("rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()


def s3_forcepull_directory(
    session: boto3.Session, bucket_name: str, local_directory: Path
):
    """Force-pull a directory from S3, deleting any local files that are not in the bucket

    Use MD5 checksums to determine whether a file has been modified locally and needs to be re-downloaded.
    """
    s3 = session.client("s3")

    remote_files, remote_symlinks = s3_list_remote_files(session, bucket_name)

    # Download files from bucket
    for filepath, remote_checksum in remote_files.items():
        local_file_path = local_directory / filepath

        # Checksum comparison
        should_download = True
        if local_file_path.exists():
            local_checksum = compute_md5(local_file_path)
            should_download = local_checksum != remote_checksum

        # Download file if it doesn't exist or checksum is different
        if should_download:
            logger.debug(f"Downloading {filepath}...")
            local_file_path.parent.mkdir(parents=True, exist_ok=True)
            s3.download_file(bucket_name, filepath, local_file_path.as_posix())

    # Set local symlinks to match remote redirects
    for s3link, s3target in remote_symlinks.items():
        if s3target.startswith("http"):
            logger.warning(
                f"Skipping symlink {s3link} -> {s3target} because it is a full URL and not a relative path."
            )
            continue
        local_link = local_directory / s3link
        local_link.parent.mkdir(parents=True, exist_ok=True)
        local_target_abs = local_directory / s3target.lstrip("/")
        local_target = os.path.relpath(
            local_target_abs.as_posix(), local_link.parent.as_posix()
        )
        if (
            local_link.exists()
            and local_link.is_symlink()
            and local_link.resolve() == local_target_abs
        ):
            logger.debug(
                f"Skipping symlink {local_link.as_posix()} -> {local_target} because it is already up to date."
            )
        else:
            if local_link.exists():
                local_link.unlink()
            logger.debug(
                f"Creating local symlink {local_link.as_posix()} -> {local_target}"
            )
            os.symlink(local_target, local_link.as_posix())

    # Delete local files not present in bucket
    local_files = set()
    directories = set()
    for local_file_path in local_directory.rglob("*"):
        if local_file_path.is_file():
            relative_path = str(local_file_path.relative_to(local_directory))
            local_files.add(relative_path)
            exists_remotely = (
                relative_path in remote_files.keys()
                or relative_path in remote_symlinks.keys()
            )
            if not exists_remotely:
                logger.debug(f"Deleting local file {relative_path}...")
                local_file_path.unlink()
        elif local_file_path.is_dir():
            directories.add(local_file_path)

    # Delete empty directories
    for directory in sorted(directories, key=lambda x: len(x.parts), reverse=True):
        if not any(directory.iterdir()):
            logger.debug(f"Deleting empty directory {directory}...")
            directory.rmdir()


def s3_forcepush_directory(
    session: boto3.Session, bucket_name: str, local_directory: Path
):
    """Force-push a directory to S3, deleting any remote files that are not in the local directory.

    Calculate a local MD5 checksum and add it as metadata to the S3 object.
    We don't use ETag for this because it is not the MD5 checksum of the file contents for files uploaded as multipart uploads.
    """
    s3 = session.client("s3")

    # Get list of all remote files in S3 bucket with their checksums
    remote_files, remote_symlinks = s3_list_remote_files(session, bucket_name)

    # Keep track of local files
    # This includes all files, including symlinks to files,
    # but not directories or symlinks to directories.
    # The loop below also ignores macOS garbage .DS_Store and ._* files
    local_files_relpath_list = []

    # Upload local files to S3 if they are different or don't exist remotely
    for local_file_path in list(local_directory.rglob("*")):
        # Ignore goddamn fucking macOS gunk files
        if local_file_path.name == ".DS_Store":
            continue
        if local_file_path.name.startswith("._"):
            continue

        # Ignore directories, as there is no concept of directories in S3.
        # Note that is_file() returns False for symlinks which point to directories, which is good too.
        if not local_file_path.is_file():
            continue

        relative_path = str(local_file_path.relative_to(local_directory))
        local_files_relpath_list.append(relative_path)

        # Handle symlinks
        if local_file_path.is_symlink():
            target = local_file_path.resolve()
            try:
                reltarget = str(target.relative_to(local_directory))
                # S3 Object Redirects must be absolute paths from bucket root with a leading slash (or a full URL).
                abstarget = f"/{reltarget}"
            except ValueError:
                logger.warning(
                    f"Skipping symlink {relative_path} -> {target} because it points outside of the local directory."
                )
                continue
            if (
                relative_path in remote_symlinks
                and remote_symlinks[relative_path] == abstarget
            ):
                logger.debug(
                    f"Skipping symlink {relative_path} -> {abstarget} because it is already up to date."
                )
            else:
                logger.debug(
                    f"Creating S3 Object Redirect for {relative_path} -> {abstarget}"
                )
                s3.put_object(
                    Bucket=bucket_name,
                    Key=relative_path,
                    WebsiteRedirectLocation=abstarget,
                )

        # Handle regular files
        else:
            local_checksum = compute_md5(local_file_path)

            exists_remotely = relative_path in remote_files.keys()
            remote_checksum = remote_files.get(relative_path, "NONE")
            checksum_matches = local_checksum == remote_checksum
            logger.debug(
                f"path relative/existsremote: {relative_path}/{exists_remotely}, checksum local/remote/matches: {local_checksum}/{remote_checksum}/{checksum_matches}"
            )

            if exists_remotely and checksum_matches:
                logger.debug(
                    f"Skipping {relative_path} because it is already up to date."
                )
            else:
                logger.debug(f"Uploading {relative_path}...")
                extra_args = {"Metadata": {"md5": local_checksum}}

                # If you don't do this, browsing to these files will download them without displaying them.
                # We mostly just care about this for the index/error html files.
                if local_file_path.as_posix().endswith(".html"):
                    extra_args["ContentType"] = "text/html"

                s3.upload_file(
                    Filename=local_file_path.as_posix(),
                    Bucket=bucket_name,
                    Key=relative_path,
                    ExtraArgs=extra_args,
                )

    # Delete files from S3 bucket not present in local directory
    files_to_delete = [
        f for f in remote_files.keys() if f not in local_files_relpath_list
    ]
    for file_key in files_to_delete:
        logger.debug(f"Deleting remote file {file_key}...")
        s3.delete_object(Bucket=bucket_name, Key=file_key)

    # Delete redirecting objects from S3 bucket not present as symlinks in local directory
    links_to_delete = [
        f for f in remote_symlinks.keys() if f not in local_files_relpath_list
    ]
    for link_key in links_to_delete:
        logger.debug(f"Deleting remote symlink {link_key}...")
        s3.delete_object(Bucket=bucket_name, Key=link_key)

2024 0208 Local symlinks as HTTP redirects in S3

Responses

Webmentions

Comments