Explore AST based diff tools, diff minimisation, etc #3

0xdevalias · 2024-01-30T01:11:30Z

There can be a lot of 'noise' when diffing minimised bundled code, as the bundler will often change the minified variable names it uses at times between builds (even if the rest of the code hasn't changed)

We can attempt to reduce this by using non-default git diff modes such as patience / histogram / minimal:

https://git-scm.com/docs/diff-options/2.6.7#Documentation/diff-options.txt---patience
https://git-scm.com/docs/diff-options/2.6.7#Documentation/diff-options.txt---diff-algorithmpatienceminimalhistogrammyers
https://stackoverflow.com/questions/4045017/what-is-git-diff-patience-for
- https://web.archive.org/web/20200128181055/http://git.661346.n2.nabble.com/Bram-Cohen-speaks-up-about-patience-diff-td2277041.html
- https://bryanpendleton.blogspot.com/2010/05/patience-diff.html
- https://alfedenzo.livejournal.com/170301.html
  - Patience Diff, a brief summary
  - Patience Diff also relies on the longest common subsequence problem, but takes a different approach. First, it only considers lines that are (a) common to both files, and (b) appear only once in each file. This means that most lines containing a single brace or a new line are ignored, but distinctive lines like a function declaration are retained. Computing the longest common subsequence of the unique elements of both documents leads to a skeleton of common points that almost definitely correspond to each other. The algorithm then sweeps up all contiguous blocks of common lines found in this way, and recurses on those parts that were left out, in the hopes that in this smaller context, some of the lines that were ignored earlier for being non-unique are found to be unique. Once this process is finished, we are left with a common subsequence that more closely corresponds to what humans would identify.
- https://fabiensanglard.net/git_code_review/diff.php
  - Git Source Code Review: Diff Algorithms

⇒ git diff --diff-algorithm=default -- unpacked/_next/static/chunks/pages/_app.js | wc -l
  116000

⇒ git diff --diff-algorithm=patience -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35826

⇒ git diff --diff-algorithm=histogram -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35835

⇒ git diff --diff-algorithm=minimal -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35844

Musings

⭐ Suggestion

It would be cool if ast-grep was able to show a diff between 2 files, but do it using the AST rather than just a raw text compare. Ideally we would be able to provide options to this, such as ignoring chunks where the only change is to a variable/function name (eg. for diffing minimised JavaScript webpack builds)

Ideally the output would be text still (not the AST tree), but the actually diffing could be done at the AST level.

💻 Use Cases

This would be really useful for minimising the noise when diffing minimised source builds looking for the 'real changes' between the builds (not just minimised variable names churning, etc)

Looking through current diff output formats shows all of the variable name changes as well, which equates to a lot of noise while looking for the relevant changes.

Some alternative potential workarounds I've considered are either pre-processing the files to standardize their variable/function names; and/or post-processing the diff output to try and detect when the only changes in a chunk are variable/function names, and then suppressing that chunk. Currently I'm just relying on git diff --diff-algorithm=minimal -- thefile.js

Originally posted by @0xdevalias in ast-grep/ast-grep#901

https://twitter.com/_devalias/status/1752257275585265862
- Curious if anyone has some good ideas/tools for doing fancy things to minimise the diff noise when diffing very large bundled/minified JavaScript code?
  
  Would appreciate any thoughts/tips/insights!
[feature] AST diff, with option to ignore variable/function name changes ast-grep/ast-grep#901
https://github.com/Wilfred/difftastic
- a structural diff that understands syntax
- https://difftastic.wilfred.me.uk/diffing.html
- https://difftastic.wilfred.me.uk/tree_diffing.html
  - This page summarises some of the other tree diffing tools available.
- Support ignoring differences that only consist of variable/function name changes (eg. within minified JavaScript) Wilfred/difftastic#631
https://github.com/afnanenayet/diffsitter
- A tree-sitter based AST difftool to get meaningful semantic diffs
- You can also filter which tree sitter nodes are considered in the diff through the config file.
- https://github.com/afnanenayet/diffsitter#node-filtering
  - You can filter the nodes that are considered in the diff by setting include_nodes or exclude_nodes in the config file. exclude_nodes always takes precedence over include_nodes, and the type of a node is the kind of a tree-sitter node.
    
    This feature currently only applies to leaf nodes, but we could exclude nodes recursively if there's demand for it.
- Support ignoring differences that only consist of variable/function name changes (eg. within minified JavaScript) afnanenayet/diffsitter#819
https://github.com/dandavison/delta
- A syntax-highlighting pager for git, diff, and grep output
- https://dandavison.github.io/delta/tips-and-tricks/export-to-html.html
  - Save output with colors to HTML/PDF etc
  - https://gitlab.com/saalen/ansifilter
    - ANSI sequence filter
      Ansifilter handles text files containing ANSI terminal escape codes.
      The command sequences may be stripped or be interpreted to generate formatted output (HTML, RTF, TeX, LaTeX, BBCode, Pango)
https://chat.openai.com/c/db393c4c-4131-4a0f-899f-acebcb28e813
- Private ChatGPT chat exploring options for minimising diff noise
https://git-scm.com/docs/diff-format
- https://git-scm.com/docs/diff-format#_combined_diff_format
https://github.com/waigani/diffparser
- A Golang package for parsing git diffs
- https://pkg.go.dev/github.com/waigani/diffparser
https://github.com/sergeyt/parse-diff
- parse-diff
  Simple unified diff parser for JavaScript
https://github.com/yeonjuan/parse-git-diff
- parse-git-diff
  A parser for git diff
https://github.com/ecomfe/gitdiff-parser
- gitdiff-parser
  A fast and reliable git diff parser.
https://github.com/kpdecker/jsdiff
- A javascript text differencing implementation.
- http://incaseofstairs.com/jsdiff/ (playground)
- https://github.com/kpdecker/jsdiff#defining-custom-diffing-behaviors
https://github.com/AsyncBanana/microdiff
- A fast, zero dependency object and array comparison library. Significantly faster than most other deep comparison libraries and has full TypeScript support.
- https://news.ycombinator.com/item?id=29130661
  - The fastest object diff library in JavaScript
https://github.com/trailofbits/graphtage
- Graphtage
  A semantic diff utility and library for tree-like files such as JSON, JSON5, XML, HTML, YAML, and CSV.
- https://github.com/trailofbits/graphtage#why-does-graphtage-exist
  - Diffing tree-like structures with unordered elements is tough. Say you want to compare two JSON files. There are limited tools available, which are effectively equivalent to canonicalizing the JSON (e.g., sorting dictionary elements by key) and performing a standard diff. This is not always sufficient. For example, if a key in a dictionary is changed but its value is not, a traditional diff will conclude that the entire key/value pair was replaced by the new one, even though the only change was the key itself. See our documentation for more information.
https://semanticdiff.com/
https://www.difflens.com/
- The Developer's Diff Tool
- Language Aware Diffs for your GitHub Pull Requests.
- https://github.com/marketplace/difflens
- https://marketplace.visualstudio.com/items?itemName=DiffLens.difflens
https://code.visualstudio.com/docs/sourcecontrol/overview#_vs-code-as-git-editor
https://github.com/mmueller2012/awesome-diff-tools
- Awesome Diff Tools
  Awesome tools that show differences between files and folders.

Additional links to review

https://www.monperrus.net/martin/tree-differencing
https://handmade.network/p/366/diffest/
https://x.com/NikosTsantalis/status/1767540305618780569
https://arxiv.org/abs/2403.05939
https://github.com/tsantalis/RefactoringMiner
https://www.reddit.com/r/javascript/comments/2ydmtm/gumtreediff_diff_based_on_the_ast_of_the_code/
https://courses.cs.vt.edu/cs6704/spring17/slides_by_students/CS6704_gumtree_Kijin_AN_Feb15.pdf
https://github.com/GumTreeDiff/gumtree
https://github.com/ganarajpr/ast-diff
https://github.com/prettydiff/prettydiff
https://github.com/SpoonLabs/gumtree-spoon-ast-diff
https://github.com/balayette/ast-diff
https://arxiv.org/pdf/2011.10268
https://hal.science/hal-04423080/document
https://tekin.co.uk/2020/10/better-git-diff-output-for-ruby-python-elixir-and-more
https://news.ycombinator.com/item?id=24828509
https://news.ycombinator.com/item?id=24831437
https://bryanpendleton.blogspot.com/2010/04/more-study-of-diff-walter-tichys-papers.html
https://www.researchgate.net/publication/220439403_The_String-to-String_Correction_Problem_With_Block_Moves
https://news.ycombinator.com/item?id=39778412
https://semanticdiff.com/blog/semanticdiff-vs-difftastic/
https://codinuum.github.io/gallery-cca/
https://github.com/codinuum/cca/tree/develop?tab=readme-ov-file
https://ieeexplore.ieee.org/document/4656419
https://github.com/ErnestMazurin/object-diff-ast
https://codsen.com/os/ast-compare
https://github.com/codsen/codsen/tree/main/packages/ast-compare#readme
https://lib.rs/crates/diffsitter
https://blog.atlantistech.com/semantic-diff-tool/
- https://github.com/atlantistechnology/sdt
  - Semantic Diff Tool (sdt)
  - The command-line tool sdt compares source files to identify which changes create semantic differences in the program operation, and specifically to exclude many changes which cannot be functionally important to the operation of a program or library.
https://github.com/jhchen/fast-diff
- Fast Diff
- This is a simplified import of the excellent diff-match-patch library by Neil Fraser into the Node.js environment. The match and patch parts are removed, as well as all the extra diff options. What remains is incredibly fast diffing between two strings.
  
  The diff function is an implementation of "An O(ND) Difference Algorithm and its Variations" (Myers, 1986) with the suggested divide and conquer strategy along with several optimizations Neil added.
https://github.com/jonTrent/PatienceDiff
- PatienceDiff & PatienceDiffPlus
- A concise javascript implementation of the Patience Diff algorithm
  
  Plus, an implementation of a new algorithm dubbed Patience Diff Plus, which in addition to the usual Patience Diff, identifies lines that moved.
https://github.com/yoshuawuyts/playground-patience-diff
- playground-patience-diff
- Convert the Ruby patience diff algo into javascript.
- https://gist.github.com/yoshuawuyts/86b30f07ad30104edf4671b332883908
  - patience-diff.rb
https://github.com/yoshuawuyts/changes
- Changes
- Compute the differences between two states.
https://blog.jcoglan.com/2017/02/12/the-myers-diff-algorithm-part-1/
- The Myers diff algorithm: part 1
https://blog.jcoglan.com/2017/02/15/the-myers-diff-algorithm-part-2/
- The Myers diff algorithm: part 2
https://blog.jcoglan.com/2017/02/17/the-myers-diff-algorithm-part-3/
- The Myers diff algorithm: part 3
https://blog.jcoglan.com/2017/03/22/myers-diff-in-linear-space-theory/
- Myers diff in linear space: theory
https://blog.jcoglan.com/2017/04/25/myers-diff-in-linear-space-implementation/
- Myers diff in linear space: implementation
https://blog.jcoglan.com/2017/09/19/the-patience-diff-algorithm/
- The patience diff algorithm
https://blog.jcoglan.com/2017/09/28/implementing-patience-diff/
- Implementing patience diff
https://github.com/ltwlf/json-diff-ts
- json-diff-ts
- A diff tool for JavaScript written in TypeScript.

The text was updated successfully, but these errors were encountered:

HerringtonDarkholme · 2024-01-30T01:17:19Z

I would recommend difftastic for this! Actually rspack has already used it for checking diff between its output with that of webpack.

0xdevalias · 2024-01-30T01:22:50Z

I would recommend difftastic for this

rspack has already used it for checking diff between its output with that of webpack

@HerringtonDarkholme Interesting.. do you know if they did so while suppressing the 'noise' of changed variables? Or was it more just generally to ensure they were doing compatible things. I had a quick google, but didn't seem to turn up anything specific beyond the repo/etc:

https://github.com/web-infra-dev/rspack
- A fast Rust-based web bundler

Edit: Opened the following issue on difftastic:

Support ignoring differences that only consist of variable/function name changes (eg. within minified JavaScript) Wilfred/difftastic#631

Edit 2: And this one on diffsitter:

Support ignoring differences that only consist of variable/function name changes (eg. within minified JavaScript) afnanenayet/diffsitter#819

I was also just re-reading through the diffsitter README and noticed this section that I somehow missed in the past; which sounds like it might be exactly like what I want:

https://github.com/afnanenayet/diffsitter
- A tree-sitter based AST difftool to get meaningful semantic diffs
- You can also filter which tree sitter nodes are considered in the diff through the config file.
- https://github.com/afnanenayet/diffsitter#node-filtering
  - You can filter the nodes that are considered in the diff by setting include_nodes or exclude_nodes in the config file. exclude_nodes always takes precedence over include_nodes, and the type of a node is the kind of a tree-sitter node.
    
    This feature currently only applies to leaf nodes, but we could exclude nodes recursively if there's demand for it.

0xdevalias · 2024-01-30T06:42:32Z

Though playing with diffsitter just now, it's output format seems to leave a lot to be desired compared to typical git diff unified output; and it doesn't show any context/etc currently:

provide context in the form of unchanged lines or AST nodes afnanenayet/diffsitter#744
- provide context in the form of unchanged lines or AST nodes afnanenayet/diffsitter#744 (comment)
Support git identifiers when running under git diff afnanenayet/diffsitter#627
Add support for git diff afnanenayet/diffsitter#155
Options to only show functional changes in diffs afnanenayet/diffsitter#349
feature request: show output in 2 columns afnanenayet/diffsitter#158
Widen language support to all languages supported by tree-sitter afnanenayet/diffsitter#152
- Usage with JavaScript: Widen language support to all languages supported by tree-sitter afnanenayet/diffsitter#152 (comment)
How diffsitter works, How performant it is and What cases are exemplatory good and bad afnanenayet/diffsitter#149

eg. On a very minimal example, diffsitter:

⇒ git difftool --tool diffsitter HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-AOnHKy/_buildManifest.js
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-ahyGIo/_buildManifest.js
===================================================================================

80:
---
-     "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"],

80:
---
+     "/search": ["static/chunks/pages/search-d835393483b5432a.js"],

138:
----
+   "static/chunks/5054-e2060ddbea2abdb7.js"

138:
----
-   "static/chunks/5054-8ad3d13d663a6185.js"

Vs git diff (with delta):

⇒ git diff HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js
diff --git a/unpacked/_next/static/[buildHash]/_buildManifest.js b/unpacked/_next/static/[buildHash]/_buildManifest.js
index 851a8f0..5004cc7 100644
--- a/unpacked/_next/static/[buildHash]/_buildManifest.js
+++ b/unpacked/_next/static/[buildHash]/_buildManifest.js
@@ -78,7 +78,7 @@
     "/payments/success-trial": [
       "static/chunks/pages/payments/success-trial-84597e34390c1506.js",
     ],
-    "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"],
+    "/search": ["static/chunks/pages/search-d835393483b5432a.js"],
     "/share/e/[[...shareParams]]": [
       "static/chunks/pages/share/e/[[...shareParams]]-899e50f90dac9ff5.js",
     ],
@@ -136,6 +136,6 @@
   "static/chunks/5017-f7c5e142fc7f0516.js",
   "static/chunks/3975-37a9301353b29c5d.js",
   "static/chunks/3754-ae5dc2fb759ecfc1.js",
-  "static/chunks/5054-8ad3d13d663a6185.js"
+  "static/chunks/5054-e2060ddbea2abdb7.js"
 )),
   self.__BUILD_MANIFEST_CB && self.__BUILD_MANIFEST_CB();

This is the diffsitter config I was using:

~/.config/diffsitter/config.json5

// Default: `diffsitter dump-default-config`
// See also: https://github.com/afnanenayet/diffsitter/blob/v0.8.1/assets/sample_config.json5
// Colours: `color256`, `black`, `red`, `green`, `yellow`, `blue`, `magenta`, `cyan`, `white`
{
  "formatting": {
    "default": "unified",
    "unified": {
      "addition": {
        "highlight": "green",
        "regular-foreground": "green",
        "emphasized-foreground": "white",
        "bold": true,
        "underline": false,
        "prefix": "+ "
      },
      "deletion": {
        "highlight": "red",
        "regular-foreground": "red",
        "emphasized-foreground": "white",
        "bold": true,
        "underline": false,
        "prefix": "- "
      }
    },
    "json": {
      "pretty_print": false
    },
    "custom": {}
  },
  "grammar": {
    "dylib-overrides": null,
    "file-associations": {
      "js": "typescript",
      "jsx": "tsx"
    },
  },
  "input-processing": {
    "split-graphemes": true,
    // You can exclude different tree sitter node types - this rule takes precedence over `include_kinds`.
    "exclude-kinds": null,
    // "exclude-kinds": ["string"],
    // You can specifically allow only certain tree sitter node types
    "include-kinds": null
    // "include-kinds": ["method_definition"],
  },
  // Specify a fallback command if diffsitter can't parse the given input
  // files. This is invoked by diffsitter as:
  //
  // ```sh
  // ${fallback_cmd} ${old} ${new}
  // ```
  "fallback-cmd": null,
  // "fallback-cmd": "diff",
}

And this is the .gitconfig I was using to run it as a git difftool:

# https://github.com/afnanenayet/diffsitter
[difftool "diffsitter"]
  cmd = diffsitter "$LOCAL" "$REMOTE"

# https://github.com/afnanenayet/diffsitter
[difftool "diffsitter-debug"]
  cmd = diffsitter --debug "$LOCAL" "$REMOTE"

Running it with --debug allows us to see the steps it takes during processing, and how long they take. On our small example file above, the output looks like this:

diffsitter debug output from minimal example

⇒ git difftool --tool diffsitter-debug --color=always HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js
 2024-01-30T07:09:30.627Z DEBUG diffsitter > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-oujkn7/_buildManifest.js can be parsed
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:09:30.627Z DEBUG diffsitter           > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-s3LsC2/_buildManifest.js can be parsed
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:09:30.627Z DEBUG diffsitter           > Extensions for both input files are supported
 2024-01-30T07:09:30.627Z DEBUG libdiffsitter        > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-oujkn7/_buildManifest.js to string
 2024-01-30T07:09:30.627Z INFO  libdiffsitter        > Will deduce filetype from file extension
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:09:30.627Z DEBUG libdiffsitter::parse > Constructed parser
 2024-01-30T07:09:30.628Z DEBUG libdiffsitter::parse > Parsed AST
 2024-01-30T07:09:30.628Z INFO  TimerFinished        > parse::parse_file(), Elapsed=968.896µs
 2024-01-30T07:09:30.628Z DEBUG libdiffsitter        > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-s3LsC2/_buildManifest.js to string
 2024-01-30T07:09:30.628Z INFO  libdiffsitter        > Will deduce filetype from file extension
 2024-01-30T07:09:30.628Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:09:30.628Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:09:30.628Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:09:30.628Z DEBUG libdiffsitter::parse > Constructed parser
 2024-01-30T07:09:30.629Z DEBUG libdiffsitter::parse > Parsed AST
 2024-01-30T07:09:30.629Z INFO  TimerFinished        > parse::parse_file(), Elapsed=590.376µs
 2024-01-30T07:09:30.629Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=388.582µs
 2024-01-30T07:09:30.630Z INFO  TimerFinished        > ast::process(), Elapsed=1.083248ms
 2024-01-30T07:09:30.630Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=309.698µs
 2024-01-30T07:09:30.631Z INFO  TimerFinished        > ast::process(), Elapsed=997.049µs
 2024-01-30T07:09:30.631Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=127.357µs
 2024-01-30T07:09:30.631Z INFO  libdiffsitter::render::unified > Using stack style vertical for title
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 80 - 80)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 4
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 80
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 80
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 80 - 80)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 80 - 80)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 4
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 80
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 80
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 80 - 80)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 138 - 138)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 5
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 138
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 138
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 138 - 138)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 138 - 138)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 5
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 138
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 138
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 138 - 138)
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-oujkn7/_buildManifest.js
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-s3LsC2/_buildManifest.js
===================================================================================

80:
---
-     "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"],

80:
---
+     "/search": ["static/chunks/pages/search-d835393483b5432a.js"],

138:
----
+   "static/chunks/5054-e2060ddbea2abdb7.js"

138:
----
-   "static/chunks/5054-8ad3d13d663a6185.js"

Reducing that output to just show the relevant timing events:

// Start
2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
// ..snip..
2024-01-30T07:09:30.628Z INFO  TimerFinished        > parse::parse_file(), Elapsed=968.896µs
// ..snip..
2024-01-30T07:09:30.629Z INFO  TimerFinished        > parse::parse_file(), Elapsed=590.376µs
2024-01-30T07:09:30.629Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=388.582µs
2024-01-30T07:09:30.630Z INFO  TimerFinished        > ast::process(), Elapsed=1.083248ms
2024-01-30T07:09:30.630Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=309.698µs
2024-01-30T07:09:30.631Z INFO  TimerFinished        > ast::process(), Elapsed=997.049µs
2024-01-30T07:09:30.631Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=127.357µs
// ..snip..
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 138 - 138)
// Finish

We can then see where the real performance hit is when running this against a MUCH larger file/diff (the file being diffed is ~8.4MB; ~249,128 lines long)

diffsitter debug output from large example

⇒ du -sh unpacked/_next/static/chunks/pages/_app.js
8.4M	unpacked/_next/static/chunks/pages/_app.js

⇒ cat unpacked/_next/static/chunks/pages/_app.js | wc -l
  249128

⇒ git difftool --tool diffsitter-debug HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js
 2024-01-30T07:07:58.282Z DEBUG diffsitter > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-wpgXXK/_app.js can be parsed
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:07:58.282Z DEBUG diffsitter           > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-6fi10i/_app.js can be parsed
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:07:58.282Z DEBUG diffsitter           > Extensions for both input files are supported
 2024-01-30T07:07:58.287Z DEBUG libdiffsitter        > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-wpgXXK/_app.js to string
 2024-01-30T07:07:58.287Z INFO  libdiffsitter        > Will deduce filetype from file extension
 2024-01-30T07:07:58.291Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:07:58.291Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:07:58.291Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:07:58.291Z DEBUG libdiffsitter::parse > Constructed parser
 2024-01-30T07:07:59.790Z DEBUG libdiffsitter::parse > Parsed AST
 2024-01-30T07:07:59.791Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.50364822s
 2024-01-30T07:07:59.795Z DEBUG libdiffsitter        > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-6fi10i/_app.js to string
 2024-01-30T07:07:59.795Z INFO  libdiffsitter        > Will deduce filetype from file extension
 2024-01-30T07:07:59.800Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:07:59.800Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:07:59.800Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:07:59.800Z DEBUG libdiffsitter::parse > Constructed parser
 2024-01-30T07:08:01.160Z DEBUG libdiffsitter::parse > Parsed AST
 2024-01-30T07:08:01.160Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.364231801s
 2024-01-30T07:08:01.985Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=825.767557ms
 2024-01-30T07:08:02.972Z INFO  TimerFinished        > ast::process(), Elapsed=1.812168007s
 2024-01-30T07:08:03.814Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=841.781925ms
 2024-01-30T07:08:04.787Z INFO  TimerFinished        > ast::process(), Elapsed=1.815188386s
 2024-01-30T07:20:22.478Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=737.685108096s
 2024-01-30T07:20:22.502Z INFO  libdiffsitter::render::unified > Using stack style horizontal for title
 2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 52698 - 52698)
 2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > Title string has length of 7
 2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > Printing line 52698
 2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > End line 52698

// ..snip 17,524 lines..

 2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96300
 2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > End line 96300
 2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96301

// I'm not sure if this actually finished successfully, or just silently crashed at this point... I think it might have crashed, as I would have expected to see a log line like:
//   DEBUG libdiffsitter::render::unified > End hunk
// As well as the actual diff output

Reducing that output to just show the relevant timing events:

// Start
2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
// ..snip..
2024-01-30T07:07:59.791Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.50364822s
// ..snip..
2024-01-30T07:08:01.160Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.364231801s
2024-01-30T07:08:01.985Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=825.767557ms
2024-01-30T07:08:02.972Z INFO  TimerFinished        > ast::process(), Elapsed=1.812168007s
2024-01-30T07:08:03.814Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=841.781925ms
2024-01-30T07:08:04.787Z INFO  TimerFinished        > ast::process(), Elapsed=1.815188386s
2024-01-30T07:20:22.478Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=737.685108096s
// ..snip..
2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96301
// Finish

We can see that ~12.29 minutes was spent in diff::compute_edit_script(), and then it looks like the script might have silently crashed, as I didn't see the diff output, nor the final DEBUG libdiffsitter::render::unified > End hunk line that I would have expected to see.

Note: I was piping this command to subl (Sublime Text), despite what the code block above was edited to look like; I haven't tried running this again yet printing directly to STDOUT/a file. I also haven't tried running this without debug mode to see if it somehow doesn't crash that way.

As a point of comparison, difftastic was seemingly able to diff this file in ~2.46sec total:

⇒ time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD --   2.63s user 0.46s system 47% cpu 6.494 total
subl  0.01s user 0.02s system 0% cpu 6.746 total

Though with a lot of this printed among the diff output:

_app.js --- 1/674 --- Text (8.39 MiB exceeded DFT_BYTE_LIMIT)

https://github.com/Wilfred/difftastic/blob/master/CHANGELOG.md#044-released-2nd-march-2023
- If a file exceeds DFT_BYTE_LIMIT, difftastic now displays its size in the header.
https://github.com/Wilfred/difftastic/blob/master/CHANGELOG.md#042-released-15th-january-2023
- Fixed an issue with unwanted underlines with textual diffing when DFT_BYTE_LIMIT is reached.
https://github.com/Wilfred/difftastic/blob/master/CHANGELOG.md#020-released-20th-february-2022
- difftastic will now use a text diff for large files that are too big to parse in a reasonable amount of time. This threshold is configurable with --byte-limit and DFT_BYTE_LIMIT.

It seems that when DFT_BYTE_LIMIT is exceeded, difftastic falls back to just doing a basic text diff, so those above timings aren't really a fair comparison. We can override this with --byte-limit and DFT_BYTE_LIMIT:

⇒ difftastic -h

Difftastic 0.52.0

..snip..

OPTIONS:

..snip..

        --byte-limit <LIMIT>           Use a text diff if either input file exceeds this size. [env:
                                       DFT_BYTE_LIMIT=] [default: 1000000]

..snip..

        --graph-limit <LIMIT>          Use a text diff if the structural graph exceed this number of
                                       nodes in memory. [env: DFT_GRAPH_LIMIT=] [default: 3000000]

Let's set this to something much higher than we need for now:

20*1024*1024 = 20971520

Updating our .gitconfig:

# https://github.com/Wilfred/difftastic
[difftool "difftastic"]
  cmd = difft --byte-limit 20971520 "$LOCAL" "$REMOTE"

Running that command again:

 ⇒ time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD --   12.42s user 1.10s system 79% cpu 17.043 total
subl  0.01s user 0.02s system 0% cpu 17.248 total

It takes a bit longer, but then we hit a different error/limit:

_app.js --- 1/674 --- Text (2 JavaScript parse errors, exceeded DFT_PARSE_ERROR_LIMIT)

From the help:

⇒ difft --help | subl

..snip..

        --parse-error-limit <LIMIT>
            Use a text diff if the number of parse errors exceeds this value.
            
            [env: DFT_PARSE_ERROR_LIMIT=]
            [default: 0]

There seem to be 674 matches of DFT_PARSE_ERROR_LIMIT in the output, which corresponds to the 674 chunks of output.

I didn't see any specific errors output related to the parsing though.. I wonder how we can see what the actual issue was?

It seems we can dump the AST either from tree-sitter directly, or the difftastic filtered version of it as a debug:

⇒ difft --help | subl

..snip..

DEBUG OPTIONS:
        --dump-syntax <PATH>
            Parse a single file with tree-sitter and display the difftastic syntax tree.

        --dump-ts <PATH>
            Parse a single file with tree-sitter and display the tree-sitter parse tree.

Full tree-sitter AST:

⇒ time difft --dump-ts unpacked/_next/static/chunks/pages/_app.js > _app.js-difftastic-tree-sitter.ast
difft --dump-ts unpacked/_next/static/chunks/pages/_app.js >   5.08s user 9.49s system 95% cpu 15.312 total

⇒ du -h _app.js-difftastic-tree-sitter.ast
224M	_app.js-difftastic-tree-sitter.ast

⇒ cat _app.js-difftastic-tree-sitter.ast| wc -l
 2533687

difftastic 'syntax' AST:

⇒ time difft --dump-syntax unpacked/_next/static/chunks/pages/_app.js > _app.js-difftastic.ast
difft --dump-syntax unpacked/_next/static/chunks/pages/_app.js >   434.80s user 43.33s system 93% cpu 8:29.31 total

⇒ du -h _app.js-difftastic.ast
2.2G	_app.js-difftastic.ast

⇒ cat _app.js-difftastic.ast | wc -l
 10571815

0xdevalias · 2024-02-04T01:05:49Z

After the above explorations, I ended up taking a different tact and started exploring the 'post processing git diff' approach.

Initial PoC tests seemed to show some merit, so I hacked them together into a script that can filter an existing git diff passed in from a file, or STDIN:

https://twitter.com/_devalias/status/1752260576066301956
- Some alternative potential workarounds I've considered are: pre-processing the files to standardize their variable/function names; and/or post-processing the diff output to try and detect when the only changes in a chunk are variable/function names; then suppressing that chunk.
https://twitter.com/_devalias/status/1752646128372535530
- After exploring/playing around with a bunch of tools yesterday, and a hyperfocus of hacking out some PoC code this afternoon/evening, I think I have the elements of a workable solution! 🎉
  
  Just need to pull it all together and refine it once the 🧠 is refreshed/rested..
https://twitter.com/_devalias/status/1753322272293896385
- The code is still pretty hacky.. but pulled the PoC's together into a usable script for filtering the noise in the diff output!
  
  Still not perfect.. but I am really liking these stats so far!
  
  For an 8.4mb _app.js file (250,022 lines):
  
  Original diff: 33,399
  Filtered: 7,516
https://twitter.com/_devalias/status/1753322275984867513
- There are still a bunch of areas where my partial AST parsing is getting errors, which means more noise will slip through till I can fix those.. but this is already super useful!
  
  Hopefully will get a version of it cleaned up/committed/pushed sooner rather than later.
https://twitter.com/_devalias/status/1753711267137892684
- A few more updates that I posted elsewhere as I was going; from 7100, to 6455, to 4268, and now down to 3913 lines.
  - Added a few more ‘pre AST parsing’ fixups that get the diff down even further: from the previous 7100 lines to 6455 lines
  - Trialled parsing the AST with a more lenient parser as a pre-step; filtering out unknown bits, then generating that back into code; that I then apply some of my previous ‘preparation hacks’ to, before parsing it with the more strict parser and doing the identifier normalisation.
    
    That’s currently reduced the 6455 lines down to 4268 lines; and there are still leftover parsing errors that could potentially get that number lower still that I’m chipping away at.
    
    Another alternative could be to completely replace the stricter parser with the more lenient one; but then I would have to rewrite a bunch of other code as well.
    
    I could also potentially try a different lenient parser; or pre-process the code being passed to this one; as I believe it’s failing to parse some statements even; which is potentially leading to more of that noise filtering through..
  - I added a pre-processing step to the lenient parser which improved things a bit; then I added an actual fix to the AST from the lenient parser to help fix more complex ternaries that were challenging to fix in my simpler string append pre/post fix methods.
    
    With those, we’re now down to 3913 lines.
    
    Being able to properly detect and fix object properties that are missing their wrapping object would probably fix a bunch more of the issues.
    
    Though maybe when I look at this next; I’ll just see whether using the lenient parser on its own gives a quicker/better result.

Edit: Newer updates to this Twitter thread continue in this comment below: #3 (comment)

0xdevalias · 2024-02-24T03:01:17Z

With the current PoC implementations in the diff minimiser, we're grouping by the diff chunks, and then by added/removed lines within those chunks.

Sometimes a section of code will churn and move a large amount of lines from one chunk, to another chunk, which won't get noticed by the minimiser currently.

It would be good if we could process these in a similar way to git diff's color-moved, with our 'ignore identifier changes', so that we can suppress both the 'added' and 'removed' chunks if they're otherwise unchanged (or minimise the diff between them to only show what has changed, rather than showing them as large 'disconnected blocks')

Edit: See also:

Highlight moved, novel regions Wilfred/difftastic#508

--

VSCode recently added a new feature called move code detection. The algorithm is surprisingly simple: after normal diff, calculate similarity(deletion, insertion) for every deletion & insertion pair. The code can be found here https://github.com/microsoft/vscode/blob/166097a20cbd06d10d255ef561837c439f372de3/src/vs/editor/common/diff/defaultLinesDiffComputer/computeMovedLines.ts#L44-L85.

But intuitively demonstrating the moving relationship on TUI might be a challenge.

Originally posted by @QuarticCat in Wilfred/difftastic#508 (comment)

--

Have you looked into git diff's --color-moved at all?

https://git-scm.com/docs/git-diff#Documentation/git-diff.txt---color-movedltmodegt

Personally I tend to use the zebra mode of it. From my git aliases:

https://github.com/0xdevalias/dotfiles/blob/a59d60e18b58069fe1d2c21dc5ef42a1c0afe797/git/gitconfig.symlink#L248-L259
  # https://git-scm.com/docs/git-diff#Documentation/git-diff.txt---color-movedltmodegt
  # https://git-scm.com/docs/git-diff#Documentation/git-diff.txt---color-moved-wsltmodesgt
  # https://git-scm.com/docs/git-config#Documentation/git-config.txt-color
  # https://git-scm.com/docs/git-config#Documentation/git-config.txt-colordiffltslotgt
  diff-refactor = \
    -c color.diff.oldMoved='white dim' \
    -c color.diff.oldMovedAlternative='white dim' \
    -c color.diff.newMoved='white dim' \
    -c color.diff.newMovedAlternative='white dim' \
    -c color.diff.newMovedDimmed='white dim' \
    -c color.diff.newMovedAlternativeDimmed='white dim' \
    diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal
Originally posted by @0xdevalias in Wilfred/difftastic#539 (comment)

0xdevalias · 2024-03-28T07:24:24Z

A lot of this will probably be uninteresting/useless noise, but I just made a repo and pushed all of my hacky WIP/PoC AST exploration scripts to it so they're captured in one place:

https://github.com/0xdevalias/poc-ast-tools

PoC scripts and tools for working with (primarily JavaScript) ASTs.

What might be of interest though, is that I pushed the current hacky state of my diff minimiser code there too. I haven't checked it for a good few weeks, so can't even remember if it's in a runnable state currently, but if nothing else, there might be some ideas hidden away in there that help explain what I was referring to above:

0xdevalias/poc-ast-tools@9f7071f

https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser.js

https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser-poc-acorn.js

When I get a chance to get back to them, I plan to finish my research/refinements on the best method(s), and then clean it all up back to a single useful script that I will commit to this repo. But until then, those may be of some interest to someone.

Originally posted by @0xdevalias in #10 (comment)

…ripts/ast-poc

0xdevalias · 2024-04-24T04:06:52Z

Here's a fun little hack I just thought up (while the diff minimiser script still breaks --color-moved):

Diff 2 files, run them through the minimiser, save that diff to a file

Apply the output of the minimised diff to the original file, save that to a file

Re-run diff with --color-moved between the original file, and the original+minimisedDiffPatch file

Currently this doesn't work fully, as I think I'm not properly updating the line metadata in the diff hunks when I've removed sections from them in the diff minimiser.. but something like this (in theory):
⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | \
sed 's|a/unpacked/_next/static/chunks/pages/_app.js|a/_app.original.js|g' | \
sed 's|b/unpacked/_next/static/chunks/pages/_app.js|b/_app.patched.js|g' \
> _app.minimised.diff

⇒ git show 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js > _app.original.js

⇒ git apply _app.minimised.diff

⇒ git -c color.diff.oldMoved='white dim' \
    -c color.diff.oldMovedAlternative='white dim' \
    -c color.diff.newMoved='white dim' \
    -c color.diff.newMovedAlternative='white dim' \
    -c color.diff.newMovedDimmed='white dim' \
    -c color.diff.newMovedAlternativeDimmed='white dim' \
    diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.js
Edit: I wrote a new scripts/fix-diff-headers.js (f7deec7) script that seems to get it closer to working..
⇒ ./scripts/fix-diff-headers.js _app.minimised.diff _app.fixed.diff
Reading diff from _app.minimised.diff and writing corrected diff to _app.fixed.diff
Diff successfully corrected and written to _app.fixed.diff

⇒ patch _app.original.js _app.fixed.diff -o _app.patched.js
patching file _app.original.js
6 out of 90 hunks failed--saving rejects to _app.patched.js.rej

⇒ git -c color.diff.oldMoved='white dim' \
    -c color.diff.oldMovedAlternative='white dim' \
    -c color.diff.newMoved='white dim' \
    -c color.diff.newMovedAlternative='white dim' \
    -c color.diff.newMovedDimmed='white dim' \
    -c color.diff.newMovedAlternativeDimmed='white dim' \
    diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.js
That way, instead of just seeing a giant chunk of 'removed' and a giant chunk of 'added' when this module moved; we can see the dimmed lines which were moved (but otherwise unchanged), and spend less time/effort having to consider them:

An even better version of this would be a diff --color-moved implementation that was able to ignore identifier names changing entirely (in which case I suspect it would mark this entire block as 'moved but not changed'); but as far as I am currently aware, the only way to do that would be to basically re-implement the whole diff --color-moved logic ourselves; which would basically end up with us writing our own AST differ I believe.

Though the way I was originally intending on fixing/improving the diff minimiser script was to try just maintaining the ANSI colouring of the input lines, and outputting the lines with the same colouring at the end.

Originally posted by @0xdevalias in #10 (comment)

0xdevalias · 2024-12-14T00:44:28Z

From some Twitter chat with @michaelskyba, talking about git diff --color-moved and related:

https://x.com/sucralose__/status/1866280320737132796
- @michaelskyba (Dec 10)
  Wait, --color-moved just works. I'm pretty sure @0xdevalias was using something similar. Maybe it's fine to just remove all moved lines from the diff and call it a day
https://x.com/sucralose__/status/1866282450197909758
- @michaelskyba (Dec 10)
  Uh there's no option for that? Online I only see people using heuristics like "same line added and removed same number of times" but I feel like git is doing something more sophisticated, especially given the whitespace-related options
https://x.com/_devalias/status/1866633367560872105
- @0xdevalias (Dec 11)
  Yeah.. the code for it is open source, so you could go try and understand how the algorithm works. I wanted to, but I never got around to spending the time to figure it out.
  
  I think I was going to see if I could port it out of git into something I could use in my own code
https://x.com/sucralose__/status/1866839846901387687
- @michaelskyba (Dec 12)
  Right now the time investment for that is lower ROI, but by EOY 2025, consumer LLMs will hopefully be able to just take the 600K lines of git's source code and make a nice single-paragraph summary of any individual feature. The input diffs will be 100x more complex too though
https://x.com/_devalias/status/1867721773070135799
- @0xdevalias (Dec 14)
  True.. but you can also probably find it pretty quickly with a couple of normal searches of the git source repo.
https://x.com/_devalias/status/1867721812924412021
- @0xdevalias (Dec 14)
  Tangentially related, just stumbled upon this again, I always forget that git ships with a bunch of extra contrib scripts that aren't exposed through the main CLI: https://github.com/git/git/tree/2ccc89b0c16c51561da90d21cfbb4b58cc877bf6/contrib/diff-highlight
https://x.com/_devalias/status/1867723412082503694
- @0xdevalias (Dec 14)
  Back to --color-moved: https://github.com/git/git/blob/2ccc89b0c16c51561da90d21cfbb4b58cc877bf6/diff.c#L5825-L5827
  
  diff_opt_color_moved: https://github.com/git/git/blob/2ccc89b0c16c51561da90d21cfbb4b58cc877bf6/diff.c#L5214-L5233
  
  Then it looks like options->color_moved is referenced a bunch throughout that diff.c
https://x.com/_devalias/status/1867724495718690871
- @0xdevalias (Dec 14)
  GitHub Copilot isn't the most useful.. at least from a first prompt
https://x.com/_devalias/status/1867724970648121836
- @0xdevalias (Dec 14)
  Claude 3.5 Sonnet does better, with that source file attached
  - https://claude.ai/chat/768b1075-5517-49b3-be72-11867c0bac9e
https://x.com/_devalias/status/1867725273720139901
- @0xdevalias (Dec 14)
  As does ChatGPT 4o
  - https://chatgpt.com/c/675ccd5a-dbf4-8008-b964-beb295ed07d0
https://x.com/_devalias/status/1867727351490261458
- @0xdevalias (Dec 14)
  Oh, I remember why I wanted to be able to re-implement --color-moved in my own code... it was because I wanted to be able to add an 'ignore identifiers' concept to it, based on parsing the code into an AST. Could also work around that by stabilising the identifiers before diffing
https://x.com/_devalias/status/1867731895058149712
- @0xdevalias (Dec 14)
  But parsing the AST's across so many huge files takes a lot of time, so I was leveraging git diff to 'lower the noise' before needing to do the AST parsing.. but I guess either way if I wanted to ignore the identifiers I would have to parse it at some point...
https://x.com/_devalias/status/1867731965576872100
- @0xdevalias (Dec 14)
  Jumping back to those AI summaries now that I've actually read them.. Claude's is nice as a higher level overview, but ChatGPT 4o is definitely getting into the nittier grittier details.
https://x.com/_devalias/status/1867733682695680329
- @0xdevalias (Dec 14)
  Well.. thats a side rabbithole distraction I don't have capacity for right now.. but based on the performance of those queries.. it seems way more plausible to have it generate a standalone implementation of --color-moved these days than it was when I was first exploring it
https://x.com/_devalias/status/1867736368971231739
- @0xdevalias (Dec 14)
  Lol.. I went to document all of the above on one of my GitHub comments, and then found this one in the same issue that talks about my color-moved thoughts: Explore AST based diff tools, diff minimisation, etc #3 (comment)
  
  And it also references a vscode 'moved code detection' feature that might also be useful to look at:
  - Explore AST based diff tools, diff minimisation, etc #3 (comment)
https://x.com/_devalias/status/1867736959747330065
- @0xdevalias (Dec 14)
  Lol.. apparently also the git diff -> minimiser -> patch -> apply on old version -> diff again with colour method I mentioned somewhere the other day:
  - Explore AST based diff tools, diff minimisation, etc #3 (comment)

Edit (2025-04-10): Decided to see how Claude 3.7 does with looking at diff.c and explaining how --color-moved works:

https://claude.ai/share/9fac6b65-2865-458d-9e60-e29b7dbb89f1
- Claude Response
  
  The git diff --color-moved feature is designed to help developers more easily identify code blocks that have been moved within a file. Let me walk through how it's implemented in the Git codebase based on the provided diff.c file.
  
  Overview of the Implementation
  
  The implementation involves several key components:
  1. Detection of moved lines: Finding blocks of lines that appear in different locations
  2. Marking those blocks with special flags: To indicate they are moved content
  3. Applying special coloring: Rendering the moved blocks with distinct colors
  4. Special handling for whitespace changes: Optional handling for whitespace differences
  Core Data Structures
  
  Key data structures for this functionality include:
  - struct emitted_diff_symbol: Represents a line in the diff output with metadata
  - struct moved_entry: Represents a line from the original content that might have been moved
  - struct moved_block: Represents a contiguous block of lines that moved together
  Detailed Implementation Flow
  
  1. Configuration and Setup
  
  When a user specifies --color-moved[=<mode>], the diff_opt_color_moved function processes this option:
```
static int diff_opt_color_moved(const struct option *opt,
                              const char *arg, int unset)
{
    struct diff_options *options = opt->value;

    if (unset) {
        options->color_moved = COLOR_MOVED_NO;
    } else if (!arg) {
        if (diff_color_moved_default)
            options->color_moved = diff_color_moved_default;
        if (options->color_moved == COLOR_MOVED_NO)
            options->color_moved = COLOR_MOVED_DEFAULT;
    } else {
        int cm = parse_color_moved(arg);
        if (cm < 0)
            return error(_("bad --color-moved argument: %s"), arg);
        options->color_moved = cm;
    }
    return 0;
}
```
  The color_moved field can have several values, including:
  - COLOR_MOVED_NO: Disabled
  - COLOR_MOVED_DEFAULT: Default coloring
  - COLOR_MOVED_BLOCKS: Color only the old/new blocks
  - COLOR_MOVED_ZEBRA: Alternating colors for different blocks
  - COLOR_MOVED_ZEBRA_DIM: Like zebra but with dimmed uninteresting parts
  2. Diff Generation Process
  
  During the diff generation, when the color-moved feature is active:
  1. The system first generates all the diff symbols normally
  2. Instead of directly emitting them, it collects them in options->emitted_symbols
  3. After all symbols are collected, it performs the move detection
  4. Finally, it renders the symbols with appropriate coloring
  3. Move Detection Algorithm
  
  The move detection happens in diff_flush_patch_all_file_pairs:
```
if (o->color_moved) {
    struct mem_pool entry_pool;
    struct moved_entry_list *entry_list;

    mem_pool_init(&entry_pool, 1024 * 1024);
    entry_list = add_lines_to_move_detection(o, &entry_pool);
    mark_color_as_moved(o, entry_list);
    if (o->color_moved == COLOR_MOVED_ZEBRA_DIM)
        dim_moved_lines(o);

    mem_pool_discard(&entry_pool, 0);
    free(entry_list);
}
```
  The core of the algorithm is in add_lines_to_move_detection() and mark_color_as_moved():
  
  a. Line Hashing and Identification
  
  add_lines_to_move_detection() builds a specialized data structure that:
  - Scans all added and removed lines in the diff
  - Hashes the content of each line for efficient comparison
  - Groups lines into a hashmap to quickly find matching lines
  - Connects adjacent lines to help identify blocks
  b. Block Matching
  
  mark_color_as_moved() uses the data structure to:
  - Find blocks of lines that were moved
  - Mark these blocks with special flags (like DIFF_SYMBOL_MOVED_LINE)
  - Alternate between different styles for adjacent blocks (when zebra mode is on)
```
static void mark_color_as_moved(struct diff_options *o,
                              struct moved_entry_list *entry_list)
{
    // Find matching blocks of moved code
    // Mark them with DIFF_SYMBOL_MOVED_LINE flags
    // Apply alternating styles when in zebra mode
}
```
  c. Special Whitespace Handling
  
  If color_moved_ws_handling is enabled, the code adds special handling for whitespace changes:
```
if (o->color_moved_ws_handling & COLOR_MOVED_WS_ALLOW_INDENTATION_CHANGE)
    fill_es_indent_data(l);
```
  This allows moved blocks with only whitespace changes to still be recognized.
  
  4. Rendering the Output
  
  Finally, when rendering the diff, different colors are applied based on the flags:
```
case DIFF_SYMBOL_PLUS:
    switch (flags & (DIFF_SYMBOL_MOVED_LINE |
                   DIFF_SYMBOL_MOVED_LINE_ALT |
                   DIFF_SYMBOL_MOVED_LINE_UNINTERESTING)) {
    case DIFF_SYMBOL_MOVED_LINE |
         DIFF_SYMBOL_MOVED_LINE_ALT |
         DIFF_SYMBOL_MOVED_LINE_UNINTERESTING:
        set = diff_get_color_opt(o, DIFF_FILE_NEW_MOVED_ALT_DIM);
        break;
    case DIFF_SYMBOL_MOVED_LINE |
         DIFF_SYMBOL_MOVED_LINE_ALT:
        set = diff_get_color_opt(o, DIFF_FILE_NEW_MOVED_ALT);
        break;
    /* ... other cases for other combinations of flags ... */
    }
```
  Summary
  
  The --color-moved feature implements a sophisticated algorithm that:
  1. Identifies content blocks that appear to have moved rather than being deleted/added
  2. Marks them with special flags
  3. Renders them with distinctive colors to make the movement visually apparent
  4. Optionally handles whitespace differences intelligently
  This helps developers understand diffs more quickly by visually distinguishing between content that's genuinely new/removed and content that's simply been relocated.

Some more Twitter chat with @michaelskyba, talking about diff minimisation:

https://x.com/sucralose__/status/1866270226951688481
- @michaelskyba (Dec 10)
  hhh I had a seemingly robust setup from iterating with o1 but the diff was still too large. I thought the identifier differences might be fairly limited and detectable after all, but I still don't know the pattern yet. So for now I'm going with the simple, general normalization
https://x.com/_devalias/status/1866629461061537982
- @0xdevalias (Dec 11)
  Instead of trying to figure identifiers by regex/etc, I went with AST parsing of the diff lines in my diff minifier. I first tried using a 'strict' parser (babel), and coming up with a bunch of hacks to 'fix' the diff lines to make them parse; but that was super hacky.
https://x.com/_devalias/status/1866629664518836679
- @0xdevalias (Dec 11)
  I then tried doing it with less strict parser, and that seemed to work without needing any hacks.
  
  It's possible I could also maybe just do it using the tokenizer part and not even need the parser, but I don't think I ever tested that part deeply.
https://x.com/_devalias/status/1866630342742061226
- @0xdevalias (Dec 11)
  Obviously there's this old issue of mine: Explore AST based diff tools, diff minimisation, etc #3
  
  And this of yours: Simple diff analysis based on strings and identifiers #10
  
  Both of which I believe contain some of the notes/thoughts around this process, eg. from a quick search, this comment: Simple diff analysis based on strings and identifiers #10 (comment)
https://x.com/_devalias/status/1866630553367351470
- @0xdevalias (Dec 11)
  And this comment seems to link to where I uploaded my PoC scripts: Simple diff analysis based on strings and identifiers #10 (comment)
  
  https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser-poc-acorn.js
  
  https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser.js
https://x.com/sucralose__/status/1866847921661956247
- @michaelskyba (Dec 12)
  Hmm, even if hacky, the approach is interesting. It's good design on acorn's part that it separates its different parsing modes like this; if diff did the same, the moved lines interpretation to wrappers or jsdiff etc. would be simpler
https://x.com/_devalias/status/1867739336651026611
- @0xdevalias (Dec 14)
  nods true. Though acorn is designed as a library to be consumed by others, whereas git's diff internals aren't really.
https://x.com/_devalias/status/1867739549176303779
- @0xdevalias (Dec 14)
  I suspect if you dug into the c code there would likely be aspects you could rip out and sort of reuse as a library.. but at that point may as well reimplement the good bits; which ties back into our other thread:
  - https://x.com/_devalias/status/1866633367560872105

0xdevalias · 2025-04-10T04:40:03Z

https://semanticdiff.com/

https://semanticdiff.com/docs/understand-diff/

https://app.semanticdiff.com/

https://marketplace.visualstudio.com/items?itemName=semanticdiff.semanticdiff

https://github.com/Sysmagine

https://github.com/Sysmagine/SemanticDiff

semanticdiff.com/blog/semanticdiff-vs-difftastic

Not sure if I realised this earlier and forgot, but seems semanticdiff has an online version that allows us to see GitHub commits/PRs/etc (and from a quick peek in DevTools/curl, seems that it's API could be potentially used):

https://app.semanticdiff.com/

Here's an example of the last build I committed to chatgpt-source-watch with GitHub's diff, and SemanticDiff's:

It seemed to work ok on some small files (that were already not too bad with GitHub), but when I tried to get it to parse some of the bigger chunk files, it generally just gave an error:

The computation took too long.

{
  "type": "error",
  "error": {
    "type": "timeout_error",
    "message": "The provided timeout was reached"
  }
}

0xdevalias · 2025-04-10T05:11:48Z

From Twitter / X:

https://x.com/_devalias/status/1910197309679575226
- @0xdevalias (April 10th, 2025)
  
  Was skimming through old notes on diffing stuff today ( Explore AST based diff tools, diff minimisation, etc #3 ), and thought this project from @trailofbits might be of interest:
  
  https://github.com/trailofbits/graphtage
  
  A semantic diff utility and library for tree-like files such as JSON, JSON5, XML, HTML, YAML, and CSV.

0xdevalias · 2025-04-10T07:12:08Z

RE: My 'post processing git diff' diff minimiser, as referenced in previous comments such as:

2024-02-04: Explore AST based diff tools, diff minimisation, etc #3 (comment)
2024-02-24: Explore AST based diff tools, diff minimisation, etc #3 (comment)
2024-03-28: Explore AST based diff tools, diff minimisation, etc #3 (comment)
2024-12-14: Explore AST based diff tools, diff minimisation, etc #3 (comment)

Since it's been a while since I looked at this, and the state of LLMs being able to handle large/complex code has come a long way since the start of 2024 (When OpenAI's GPT-4 was... not that great for it), I figured I'd ask a few different LLMs to evaluate the diff minimiser code with this prompt, and attaching the following two files:

What are these scripts attempting to do, how effective do they seem at it, and how can we make them better?

Here are the results for that single starter prompt, with no deeper follow up questions:

Claude 3.7:
- https://claude.ai/share/fdb6da96-ea24-4e1f-b58a-8d424e560d41
  - My private chat link: https://claude.ai/chat/d44d5f07-5f5f-46b4-ac7c-4003d3a26b72
- Response
  I'll analyze these JavaScript scripts to understand their purpose, effectiveness, and potential improvements.
  
  What These Scripts Are Attempting to Do
  
  These scripts (diff-minimiser.js and diff-minimiser-poc-acorn.js) are trying to build a tool that intelligently filters Git diffs to reduce noise from non-meaningful changes. They specifically target scenarios like:
  1. When minified JavaScript variable names change but the code functionality remains identical
  2. When code is structurally equivalent after normalization of identifiers
  3. When changes are merely cosmetic or represent refactoring with no semantic difference
  The core concept is to parse code fragments from diff chunks, normalize identifiers (renaming all variables to generic names like symbol1, symbol2), and then compare the normalized versions. If they're identical after normalization, the tool filters them out as "non-meaningful" changes.
  
  How Effective They Seem
  
  The scripts show significant promise but are still in development phase:
  1. Core Functionality: They successfully parse Git diffs, group related changes, normalize JavaScript code fragments, and can filter out chunks with only identifier changes.
  2. Parser Robustness: They use multiple parsing approaches (Babel parser, acorn's loose parser, tokenization) to handle partial code fragments which is smart since diffs often contain incomplete code.
  3. Debugging: The extensive debug logging shows the authors are actively testing and refining the algorithm.
  4. Statistics: The code calculates effectiveness metrics like how many groups were identified as having equivalent normalized code.
  The main challenge appears to be parsing incomplete/partial JavaScript snippets from diffs reliably, which is inherently difficult. The extensive TODOs in diff-minimiser.js indicate several known edge cases that need addressing.
  
  How to Make Them Better
  1. Refine the parsing strategy:
    
    Fully leverage the acornLoose parser which seems most promising for handling fragments
    
    Implement the tokenizer-based approach (normalizeIdentifierNamesInCodeV3) as the default since it appears more robust than full AST parsing
  2. Improve the diff grouping algorithm:
    
    Implement the suggested compareNormalizedCodeDiffsV2 to handle context lines better
    
    Use the diffArrays results to identify minimal sets of actual changes
  3. Handle more edge cases:
    
    Implement detection for shifted lines (same code moved up/down slightly)
    
    Add pattern recognition for common minifier patterns like variable renaming sequences
  4. Code structure improvements:
    
    Refactor the main logic into a more modular design
    
    Implement a proper streaming processor rather than loading entire diffs in memory
    
    Consolidate the multiple parsing approaches into a single strategy with fallbacks
  5. Testing and validation:
    
    Add unit tests for each component (parser, normalizer, differ)
    
    Create a benchmark dataset of typical diffs to measure improvement
  6. Additional features:
    
    Add command-line options to control filtering strictness
    
    Support more languages beyond JavaScript
    
    Preserve Git's color-moved feature when present
  Specific Implementation Improvements
  1. Complete the arrayDiff processing in compareNormalizedCodeDiffs to minimize the changes returned based on token-level differences
  2. Implement the planned consolidated approach of filtering chunks:
    
    function processChunk(chunk) { const groups = groupChunkChanges(chunk.changes); const filteredGroups = groups.filter(group => { if (group.isContextGroup) return true; if (!group.isEqualModifiedLinesCount) return true; const normalizedResult = compareNormalizedCodeDiffs(group.changes); return !normalizedResult.isNormalisedCodeEqual; }); return reconstructChunkWithFilteredGroups(chunk, filteredGroups); }
  3. Prioritize the tokenizer approach over full AST parsing for better robustness with code fragments:
    
    function normalizeCode(code) { try { return normalizeIdentifierNamesInCodeV3(code).mappedTokensString; } catch (err) { try { return normalizeIdentifierNamesInCodeV2(code); } catch (err2) { return code; // Fallback to original if parsing fails } } }
  These scripts show a sophisticated approach to a real developer pain point, and with the improvements outlined above, could become an essential tool for developers working with minified JavaScript codebases.
Gemini 2.5 Pro Preview 03-25 (via Google AI Studio)
- https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%5B%221R2N_Sgb09cV8WYaey9t9Y7ysMndFVjFK%22%5D,%22action%22:%22open%22,%22userId%22:%22102649988257823587404%22,%22resourceKeys%22:%7B%7D%7D&usp=sharing,
- Response
  Okay, let's break down these two scripts.
  
  Overall Goal
  
  Both scripts are working towards the same ultimate goal: minimizing JavaScript diffs, especially for minified or transpiled code, by identifying and suppressing changes that are likely due to superficial differences like variable renaming, rather than significant logic changes. The core idea is to parse or tokenize code snippets from a diff, normalize them (e.g., rename all variables to generic placeholders like symbol1, symbol2), and then compare the normalized versions. If the normalized versions of an added block and a removed block are identical, the script assumes the change was non-functional and filters it out of the final diff output.
  
  Analysis of diff-minimiser-poc-acorn.js
  1. What it's attempting to do:
    
    This script is a Proof-of-Concept (POC) specifically exploring the use of the acorn-loose parser.
    
    acorn-loose is designed to be error-tolerant, making it potentially suitable for parsing potentially incomplete or slightly malformed code snippets often found in diff chunks.
    
    The script demonstrates a workflow:
    
    Parse potentially broken code with acorn-loose.
    
    Use estraverse to traverse the resulting Abstract Syntax Tree (AST) and remove "dummy" nodes that acorn-loose inserts during error recovery.
    
    Use escodegen to generate JavaScript code back from the cleaned Acorn AST.
    
    (Potentially) Take the regenerated code and parse it again with a stricter parser like @babel/parser (though the immediate benefit of this last step within the POC isn't fully realized).
    
    It also includes stubbed/POC versions of normalization functions (normalizeIdentifierNamesInCode) using Babel, likely carried over from other experiments.
  2. How effective it seems:
    
    As a POC: It's effective in demonstrating the technical feasibility of using acorn-loose -> estraverse -> escodegen to process potentially broken code snippets and generate potentially cleaner, more valid code.
    
    Practicality: It doesn't actually use this capability to minimize a diff. It works on a hardcoded snippet and just logs the intermediate steps. It proves the tools can work together but doesn't apply them to the actual problem yet.
  3. How to make it better:
    
    Integration: The useful techniques demonstrated here (using acorn-loose for error tolerance) should be integrated as one possible strategy within the main diff-minimiser.js script, specifically within the normalization functions (as seems to have been attempted in normalizeIdentifierNamesInCodeV2).
    
    Clarify Goal: The purpose of the final Babel parse/generate step within this specific POC is unclear. If the goal is just to get a cleaned AST/code string from acorn-loose, the Babel steps might be unnecessary overhead unless the intention is specifically to leverage Babel's normalization or transformation capabilities after the initial cleanup.
    
    Input: Replace the hardcoded snippet with logic to accept dynamic input (e.g., diff chunks).
  Analysis of diff-minimiser.js
  1. What it's attempting to do:
    
    This is the main script attempting the full diff minimization task.
    
    It reads a diff from a file or stdin.
    
    It uses parse-diff to parse the unified diff format into structured data (files, chunks, changes).
    
    It iterates through chunks and groups consecutive changes, paying special attention to groups containing both additions (+) and deletions (-).
    
    For groups with an equal number of added and deleted lines, it extracts the code content of these lines.
    
    It normalizes the extracted added code and deleted code separately. It explores three main normalization strategies:
    
    normalizeIdentifierNamesInCode: Uses Babel parser/traverse/generator. Prone to failing on invalid snippets, requires complex prepareCodeForASTParsing helpers.
    
    normalizeIdentifierNamesInCodeV2: Uses acorn-loose parser, estraverse, escodegen. More error-tolerant but still requires AST manipulation and preparation (prepareCodeForASTParsingV2).
    
    normalizeIdentifierNamesInCodeV3: (Currently Used) Uses acorn.tokenizer. Avoids full parsing, tokenizes the code, and normalizes only identifier tokens (specifically short ones <= 2 chars). This is likely the most robust approach for handling snippets that won't parse cleanly.
    
    It compares the normalized strings (normalisedCodeByType.del === normalisedCodeByType.add).
    
    If they match, it flags the group (isNormalisedCodeEqual = true).
    
    It filters out groups flagged as equal (and potentially surrounding context lines if they become isolated).
    
    It reconstructs and prints the filtered diff.
    
    It includes experiments using jsdiff's diffArrays to compare normalized lines within a group, although the results aren't currently used to further reduce the diff output.
  2. How effective it seems:
    
    Concept: The core concept is solid and addresses a real pain point in reviewing generated code diffs.
    
    Diff Parsing: Handles standard diffs correctly.
    
    Grouping: Basic grouping of adjacent changes works. It correctly identifies the primary target: adjacent add/del blocks of equal size.
    
    Normalization Strategy: The evolution towards the tokenizer-based approach (V3) is a pragmatic and likely effective choice given the difficulty of reliably parsing arbitrary diff snippets. Only normalizing short identifiers is a clever heuristic.
    
    Comparison & Filtering: The comparison of normalized strings and subsequent filtering appears logically sound based on the isNormalisedCodeEqual flag.
    
    Robustness: The script's main weakness lies in reliably handling the wide variety of potentially broken or incomplete code snippets found in diffs. The prepareCode... functions are complex heuristics that might not always work or could even introduce new errors. The tokenizer approach (V3) mitigates this significantly but is less powerful than full AST analysis.
    
    Limitations (Acknowledged in TODOs):
    
    Doesn't handle cases where added/deleted blocks are separated by context lines (code movement).
    
    Doesn't handle groups where add/del counts differ.
    
    Doesn't yet use the diffArrays output to minimize within a group that isn't a perfect match after normalization.
    
    Preserving git diff --color-moved information through the pipeline is challenging.
  3. How to make it better:
    
    Refine Tokenizer Normalization (V3): This seems the most promising path.
    
    Expand Normalization: Consider normalizing whitespace between tokens or other token types if needed.
    
    Refine Pre-processing: Ensure prepareCodeForASTParsingV2PreProcess robustly cleans snippets for tokenization, not necessarily full parsing. Balancing brackets might still be useful here.
    
    Utilize diffArrays: Implement the logic suggested in the TODOs. After normalizing lines in a group using V3, use diffArrays to compare normalisedDels and normalisedAdds. Reconstruct the group in the output diff including only the lines that actually differ after normalization, potentially shrinking many groups instead of just filtering perfect matches.
    
    Code Movement Detection: This is complex.
    
    Hashing: Calculate a hash of the normalized token string for each added/deleted line or block. Store these hashes. Look for identical hashes between deleted blocks and added blocks elsewhere in the file.
    
    Leverage Git: Explore if git diff --color-moved=... output can be parsed reliably or if there's a way to invoke git plumbing commands to get this info programmatically. This might be the most robust way if feasible.
    
    Improve Grouping/Chunk Handling: If code movement is detected, the concept of processing only adjacent add/del blocks needs rethinking. You might need to process adds/dels globally within a file.
    
    Refactor & Cleanup: Remove the less effective normalization strategies (V1, V2 unless a specific use case remains) and their associated preparation functions to simplify the codebase.
    
    Configuration: Add command-line flags for tuning, e.g., the identifier length threshold for normalization (<= 2), whether to ignore whitespace differences, etc.
    
    Error Handling: Improve error reporting for failed normalization/tokenization steps.
    
    Output: Ensure the final diff output is valid and that the @@ ... @@ headers accurately reflect the remaining lines and their original positions as closely as possible (this is hard when lines are removed).
  In summary, diff-minimiser.js is a capable script tackling a complex problem. Its effectiveness hinges on the robustness of its code snippet normalization strategy. The current tokenizer-based approach (V3) is a good direction. The next major improvements would be utilizing diffArrays for partial matching within groups and tackling code movement detection. The POC script served its purpose in exploring acorn-loose as a potential tool for this.

0xdevalias · 2025-04-17T06:54:06Z

RE: My 'post processing git diff' diff minimiser

Some more updates, from a private chat the other day:

If you're wanting to explore tools in the diff minimisation space, there's the code for what I was working on in the past as well. Some notes/links in this issue/comment:

Explore AST based diff tools, diff minimisation, etc #3 (comment)

And the latter part of this one:

Explore AST based diff tools, diff minimisation, etc #3 (comment)

This was a hacky side experiment:

https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser-poc-acorn.js

This was the main one I was working on:

https://github.com/0xdevalias/poc-ast-tools/blob/main/diff-minimiser.js

I haven't touched either in a long time now though.. and LLMs have come a long way since, so they'd likely be able to help improve the code a lot better a lot easier these days

It basically lets git do the initial diff, and then post-processes that to minimise it even further

And then some further updates from today:

Did you get a chance to look at this at all? I just re-ran it on one of my old test files, and it cut the diff down from 33,399 lines to 4,182 lines.

There's still definitely some noise in that.. but even if you then ran that through whatever your normal diff viewer tool is that minimises stuff, it'd cut down on a lot of noise probably
⇒ cat ../chatgpt-source-watch/_app-git.diff | wc -l
33399

⇒ cat ../chatgpt-source-watch/_app-git.diff | ./diff-minimiser.js 2>/dev/null | wc -l
4182
Noticed that it wasn't de-noising identifier churn in class private fields/similar, so made a few tweaks and now that diff is down to:
⇒ cat ../chatgpt-source-watch/_app-git.diff | ./diff-minimiser.js 2>/dev/null | wc -l
3991
These were the relevant changes to get those extra line savings:

Not sure if this one was needed specifically for this, but it bumps some of the relevant dependencies (eg. acorn specifically): 0xdevalias/poc-ast-tools@91e94ee

Then this change updates the acorn parsing to use ecmaVersion: 'latest' so it supports more modern syntax than it was before: 0xdevalias/poc-ast-tools@994f73b

And then this one updates the normalizeIdentifierNamesInCodeV3 function so it handles private identifiers (like a class's #foo): 0xdevalias/poc-ast-tools@8d13027

That last one was the main fix that got that extra 191 lines of noise reduction I mentioned above.

0xdevalias · 2025-04-21T03:33:44Z

Adding on to the old (Feb 4, 2024) Twitter thread summarised above in #3 (comment)

https://x.com/_devalias/status/1914159016940585451
- @0xdevalias (April 21, 2025)
  
  Randomly decided to look at my diff minimiser code again the other day. Not sure what state it was in when I posted the last updates here, but the version I had locally seemed to be using acorn's tokeniser instead of the full babel AST parsing that I originally started with.
https://x.com/_devalias/status/1914159020069491132
- @0xdevalias (April 21, 2025)
  
  And with what seems to be the same _app-git.diff that I was testing previously (33,399) lines, that was seemingly cutting things down to 4,182 lines (without any unsafe hacks)
https://x.com/_devalias/status/1914159022456107329
- @0xdevalias (April 21, 2025)
  
  But I noticed that it wasn't de-noising identifier churn in class private fields/similar, so made a few tweaks and now that diff is down to 3,991 lines! 🎉
https://x.com/_devalias/status/1914159025169797121
- @0xdevalias (April 21, 2025)
  
  At some stage I should probably check what an actual 'fully de-noised' line count would look like for that sample; but from a quick skim I can still seem some areas for improvement around moved code/similar.
https://x.com/_devalias/status/1914159027568927079
- @0xdevalias (April 21, 2025)
  
  And maybe another 'unsafe' mode or complimentary tool for de-noising / summarising module identifier churn; as that seems to add quite a few lines of noise when a module is imported a lot.
https://x.com/_devalias/status/1914159029942931527
- @0xdevalias (April 21, 2025)
  
  Anyway, here's the summary notes I made in a new comment on the relevant issue on my repo:
  - Explore AST based diff tools, diff minimisation, etc #3 (comment)

Also this standalone RT just in case the old thread wasn't super visible:

https://x.com/_devalias/status/1914162225193140733
- @0xdevalias (April 21, 2025)
  
  Made a few more tweaks to my PoC web app build diff minimiser script the other day.
  
  It now can de-noise my 33,399 line sample diff down to just 3,991 lines; much easier to review/see the salient changes.
  
  https://x.com/_devalias/status/1914159022456107329

As well as a crosspost to BlueSky:

https://bsky.app/profile/devalias.net/post/3lncard5b2c26

0xdevalias · 2025-05-29T05:14:38Z

See the following blog posts for a deeper dive into the Myers and Patience diff algorithms:

https://blog.jcoglan.com/2017/02/12/the-myers-diff-algorithm-part-1/

The Myers diff algorithm: part 1

https://blog.jcoglan.com/2017/02/15/the-myers-diff-algorithm-part-2/

The Myers diff algorithm: part 2

https://blog.jcoglan.com/2017/02/17/the-myers-diff-algorithm-part-3/

The Myers diff algorithm: part 3

https://blog.jcoglan.com/2017/03/22/myers-diff-in-linear-space-theory/

Myers diff in linear space: theory

https://blog.jcoglan.com/2017/04/25/myers-diff-in-linear-space-implementation/

Myers diff in linear space: implementation

https://blog.jcoglan.com/2017/09/19/the-patience-diff-algorithm/

The patience diff algorithm

https://blog.jcoglan.com/2017/09/28/implementing-patience-diff/

Implementing patience diff

Originally posted by @0xdevalias in #3 (comment)

0xdevalias mentioned this issue Jan 30, 2024

Support ignoring differences that only consist of variable/function name changes (eg. within minified JavaScript) Wilfred/difftastic#631

Open

0xdevalias added the enhancement New feature or request label Feb 24, 2024

0xdevalias mentioned this issue Feb 24, 2024

Update README to align with current scripts/process #7

Open

0xdevalias self-assigned this Feb 29, 2024

veilm mentioned this issue Mar 20, 2024

Simple diff analysis based on strings and identifiers #10

Open

0xdevalias referenced this issue in 0xdevalias/poc-ast-tools Mar 28, 2024

add hacky WIP/PoC diff minimiser scripts from chatgpt-source-watch/sc…

9f7071f

…ripts/ast-poc

0xdevalias mentioned this issue Sep 13, 2024

More deterministic renames across different versions of the same code jehna/humanify#97

Open

This was referenced Mar 10, 2025

Integrate wakaru and/or webcrack JS unminimisers (and maybe humanify too) #2

Open

"stable" identifier demangling j4k0xb/webcrack#154

Open

0xdevalias changed the title ~~Explore AST based diff tools~~ Explore AST based diff tools, diff minimisation, etc Apr 10, 2025

0xdevalias mentioned this issue Apr 10, 2025

Script to identify language/translation files + list them + diff them #4

Open

0xdevalias mentioned this issue Feb 29, 2024

Script to calculate the raw/minimised lines of diff change for each file #5

Open

This was referenced Apr 22, 2025

[diff-minimiser] Idea: Lightweight tracking and AST-based renaming of churned identifiers via localised diff analysis 0xdevalias/poc-ast-tools#4

Open

Explore AI / LLM based overview / summary of changes between web app builds #17

Open

Uh oh!

Explore AST based diff tools, diff minimisation, etc #3

Explore AST based diff tools, diff minimisation, etc #3

Comments

0xdevalias commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Musings

See Also

Additional links to review

HerringtonDarkholme commented Jan 30, 2024

Uh oh!

0xdevalias commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xdevalias commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xdevalias commented Feb 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xdevalias commented Feb 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xdevalias commented Mar 28, 2024

Uh oh!

0xdevalias commented Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xdevalias commented Dec 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview of the Implementation

Core Data Structures

Detailed Implementation Flow

1. Configuration and Setup

2. Diff Generation Process

3. Move Detection Algorithm

a. Line Hashing and Identification

b. Block Matching

c. Special Whitespace Handling

4. Rendering the Output

Summary

Uh oh!

0xdevalias commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xdevalias commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xdevalias commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What These Scripts Are Attempting to Do

How Effective They Seem

How to Make Them Better

Specific Implementation Improvements

Uh oh!

0xdevalias commented Apr 17, 2025

Uh oh!

0xdevalias commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0xdevalias commented May 29, 2025

Uh oh!

0xdevalias commented Jan 30, 2024 •

edited

Loading

0xdevalias commented Jan 30, 2024 •

edited

Loading

0xdevalias commented Jan 30, 2024 •

edited

Loading

0xdevalias commented Feb 4, 2024 •

edited

Loading

0xdevalias commented Feb 24, 2024 •

edited

Loading

0xdevalias commented Apr 24, 2024 •

edited

Loading

0xdevalias commented Dec 14, 2024 •

edited

Loading

0xdevalias commented Apr 10, 2025 •

edited

Loading

0xdevalias commented Apr 10, 2025 •

edited

Loading

0xdevalias commented Apr 10, 2025 •

edited

Loading

0xdevalias commented Apr 21, 2025 •

edited

Loading