Skip to content

Use appropriate replay mode when loading resources from Wayback #1069

Open
@Mr0grog

Description

@Mr0grog

When we rewrite resource URLs in pages or diffs to load things from archive.org instead of the live web, we currently always ask for unaltered (id_) versions of the resources:

/**
* Creates a transform that will rewrite subresource URLs to point to the
* Wayback Machine. This is useful when we have snapshots of the page itself,
* but not its subresources. It won't always work (Wayback won't always have
* a snapshot of the subresource from a similar point in time), but it'll work
* a lot better than just pointing to the original URL, which might be missing
* or significantly altered by the time a diff is viewed.
*
* Note this *creates* the transform and is not the transform itself (because
* the transform must be custom to a particular source URL and point in time).
* @param {WebMonitoringDb.Page} page
* @param {WebMonitoringDb.Version} version
*/
export function loadSubresourcesFromWayback (page, version) {
return document => {
// In some rare instances, there is old, messy version data from Versionista
// that doesn't have a URL for the version, so fall back to page URL. :(
const url = versionUrl(version) || page.url;
const timestamp = createWaybackTimestamp(version.capture_time);
document.querySelectorAll('link[rel="stylesheet"]').forEach(node => {
for (const attribute of ['href', 'data-href']) {
const value = node.getAttribute(attribute);
if (value) {
node.setAttribute(attribute, createWaybackUrl(value, timestamp, url));
}
}
});
document.querySelectorAll('script[src],img[src]').forEach(node => {
node.src = createWaybackUrl(node.getAttribute('src'), timestamp, url);
});
// TODO: handle <picture> with all its subelements
// TODO: SVG <use> directives
// TODO: video/audio (similar structure to <picture>)
return document;
};
}

function createWaybackUrl (originalUrl, timestamp, baseUrl) {
if (typeof timestamp !== 'string') {
timestamp = createWaybackTimestamp(timestamp);
}
const url = resolveUrl(originalUrl, baseUrl);
return `https://web.archive.org/web/${timestamp}id_/${url}`;
}

Instead, we should ask for the appropriate mode based on how we’re using the resource: js_ for scripts, cs_ for stylesheets, and im_ for images. We should only fall back to id_ in cases where we don’t know what type to use.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions