Skip to content

Normalize whitespace in HTML token diff #198

Open
@Mr0grog

Description

@Mr0grog

Sometimes we wind up showing a change on text that looks unchanged because nearby whitespace changed in a non-visible way. There are just a few main versions of this:

  1. Whitespace changed in a way that’s totally meaningless in HTML (outside of <pre> elements or white-space: pre* styled elements). Since multiple spaces and line breaks all get collapsed into a single space in HTML, they’re not meaningful changes unless in specific contexts. We should normalize them to a single space.
  2. Spaces getting swapped out for non-breaking spaces (or vice-versa). These changes are technically different an may have a subtle impact on page layout, but are not semantically different for users. I’m thinking a good way to handle this is to give tokens a diffable text representation (in which non-breaking spaces are replaced by spaces) and a literal text representation (where these types of characters are unchanged). The former is used for comparisons, but the latter is used when stitching the actual diff back together.
  3. Different kinds of more fancy spaces get swapped out (hair space, em space, etc.). These are more visible to the user, but still usually not that meaningful for most of this diff’s use cases. The right solution here is probably the same as for (2) above.

Here’s an example of case (2) above: https://monitoring.envirodatagov.org/page/4415ea86-293e-48ab-9b4f-da2382cc4200/c43894cb-b954-40d7-be18-4ce14a22a90b..e8844efa-fd2b-41d5-a451-1cc54c1d680a

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions