Skip to content

HTML diff should tokenize on some punctuation #6

Open
@Mr0grog

Description

@Mr0grog

This FTP diffing problem made me realize we should probably be splitting tokens in the HTML diff on periods (and maybe other punctuation?), not just on whitespace:

screen shot 2018-11-21 at 9 04 50 am

(Of course we don’t really want to use this differ on FTP listings, but that’s a different matter.)

This requires some care, though — we probably want to treat the periods as tokens themselves (in case they change), unlike whitespace. We’ve also talked about this before in terms of general punctuation handling — it would be really useful not only to split this way, but to tag and count punctuation changes separately from other changes. We might not prioritize a punctuation change for analysts to look at like we do a word change, and it would be nice to call out clearly that a change was merely in punctuation.

There are also punctuation changes we might want to treat extra special and even suppress in many cases. For example, changing to ' (apostrophe to prime) is a change we’ve seen before, and not one we generally care about.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestexperimentExperimental changes to a diff that need lots of testing and may or may not work out wellnever-stale

    Type

    No type

    Projects

    Status

    Inbox

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions