Skip to content

Explore AST based diff tools, diff minimisation, etc #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
0xdevalias opened this issue Jan 30, 2024 · 14 comments
Open

Explore AST based diff tools, diff minimisation, etc #3

0xdevalias opened this issue Jan 30, 2024 · 14 comments
Assignees
Labels
enhancement New feature or request

Comments

@0xdevalias
Copy link
Owner

0xdevalias commented Jan 30, 2024

There can be a lot of 'noise' when diffing minimised bundled code, as the bundler will often change the minified variable names it uses at times between builds (even if the rest of the code hasn't changed)

We can attempt to reduce this by using non-default git diff modes such as patience / histogram / minimal:

⇒ git diff --diff-algorithm=default -- unpacked/_next/static/chunks/pages/_app.js | wc -l
  116000

⇒ git diff --diff-algorithm=patience -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35826

⇒ git diff --diff-algorithm=histogram -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35835

⇒ git diff --diff-algorithm=minimal -- unpacked/_next/static/chunks/pages/_app.js | wc -l
   35844

Musings

⭐ Suggestion

It would be cool if ast-grep was able to show a diff between 2 files, but do it using the AST rather than just a raw text compare. Ideally we would be able to provide options to this, such as ignoring chunks where the only change is to a variable/function name (eg. for diffing minimised JavaScript webpack builds)

Ideally the output would be text still (not the AST tree), but the actually diffing could be done at the AST level.

💻 Use Cases

This would be really useful for minimising the noise when diffing minimised source builds looking for the 'real changes' between the builds (not just minimised variable names churning, etc)

Looking through current diff output formats shows all of the variable name changes as well, which equates to a lot of noise while looking for the relevant changes.

Some alternative potential workarounds I've considered are either pre-processing the files to standardize their variable/function names; and/or post-processing the diff output to try and detect when the only changes in a chunk are variable/function names, and then suppressing that chunk. Currently I'm just relying on git diff --diff-algorithm=minimal -- thefile.js

Originally posted by @0xdevalias in ast-grep/ast-grep#901

See Also

Additional links to review

@HerringtonDarkholme
Copy link

I would recommend difftastic for this! Actually rspack has already used it for checking diff between its output with that of webpack.

@0xdevalias
Copy link
Owner Author

0xdevalias commented Jan 30, 2024

I would recommend difftastic for this

rspack has already used it for checking diff between its output with that of webpack

@HerringtonDarkholme Interesting.. do you know if they did so while suppressing the 'noise' of changed variables? Or was it more just generally to ensure they were doing compatible things. I had a quick google, but didn't seem to turn up anything specific beyond the repo/etc:

Edit: Opened the following issue on difftastic:

Edit 2: And this one on diffsitter:


I was also just re-reading through the diffsitter README and noticed this section that I somehow missed in the past; which sounds like it might be exactly like what I want:

  • https://github.com/afnanenayet/diffsitter
    • A tree-sitter based AST difftool to get meaningful semantic diffs

    • You can also filter which tree sitter nodes are considered in the diff through the config file.

    • https://github.com/afnanenayet/diffsitter#node-filtering
      • You can filter the nodes that are considered in the diff by setting include_nodes or exclude_nodes in the config file. exclude_nodes always takes precedence over include_nodes, and the type of a node is the kind of a tree-sitter node.

        This feature currently only applies to leaf nodes, but we could exclude nodes recursively if there's demand for it.

@0xdevalias
Copy link
Owner Author

0xdevalias commented Jan 30, 2024

Though playing with diffsitter just now, it's output format seems to leave a lot to be desired compared to typical git diff unified output; and it doesn't show any context/etc currently:

eg. On a very minimal example, diffsitter:

image

⇒ git difftool --tool diffsitter HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-AOnHKy/_buildManifest.js
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-ahyGIo/_buildManifest.js
===================================================================================

80:
---
-     "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"],

80:
---
+     "/search": ["static/chunks/pages/search-d835393483b5432a.js"],

138:
----
+   "static/chunks/5054-e2060ddbea2abdb7.js"

138:
----
-   "static/chunks/5054-8ad3d13d663a6185.js"

Vs git diff (with delta):

image

⇒ git diff HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js
diff --git a/unpacked/_next/static/[buildHash]/_buildManifest.js b/unpacked/_next/static/[buildHash]/_buildManifest.js
index 851a8f0..5004cc7 100644
--- a/unpacked/_next/static/[buildHash]/_buildManifest.js
+++ b/unpacked/_next/static/[buildHash]/_buildManifest.js
@@ -78,7 +78,7 @@
     "/payments/success-trial": [
       "static/chunks/pages/payments/success-trial-84597e34390c1506.js",
     ],
-    "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"],
+    "/search": ["static/chunks/pages/search-d835393483b5432a.js"],
     "/share/e/[[...shareParams]]": [
       "static/chunks/pages/share/e/[[...shareParams]]-899e50f90dac9ff5.js",
     ],
@@ -136,6 +136,6 @@
   "static/chunks/5017-f7c5e142fc7f0516.js",
   "static/chunks/3975-37a9301353b29c5d.js",
   "static/chunks/3754-ae5dc2fb759ecfc1.js",
-  "static/chunks/5054-8ad3d13d663a6185.js"
+  "static/chunks/5054-e2060ddbea2abdb7.js"
 )),
   self.__BUILD_MANIFEST_CB && self.__BUILD_MANIFEST_CB();

This is the diffsitter config I was using:

  • ~/.config/diffsitter/config.json5
// Default: `diffsitter dump-default-config`
// See also: https://github.com/afnanenayet/diffsitter/blob/v0.8.1/assets/sample_config.json5
// Colours: `color256`, `black`, `red`, `green`, `yellow`, `blue`, `magenta`, `cyan`, `white`
{
  "formatting": {
    "default": "unified",
    "unified": {
      "addition": {
        "highlight": "green",
        "regular-foreground": "green",
        "emphasized-foreground": "white",
        "bold": true,
        "underline": false,
        "prefix": "+ "
      },
      "deletion": {
        "highlight": "red",
        "regular-foreground": "red",
        "emphasized-foreground": "white",
        "bold": true,
        "underline": false,
        "prefix": "- "
      }
    },
    "json": {
      "pretty_print": false
    },
    "custom": {}
  },
  "grammar": {
    "dylib-overrides": null,
    "file-associations": {
      "js": "typescript",
      "jsx": "tsx"
    },
  },
  "input-processing": {
    "split-graphemes": true,
    // You can exclude different tree sitter node types - this rule takes precedence over `include_kinds`.
    "exclude-kinds": null,
    // "exclude-kinds": ["string"],
    // You can specifically allow only certain tree sitter node types
    "include-kinds": null
    // "include-kinds": ["method_definition"],
  },
  // Specify a fallback command if diffsitter can't parse the given input
  // files. This is invoked by diffsitter as:
  //
  // ```sh
  // ${fallback_cmd} ${old} ${new}
  // ```
  "fallback-cmd": null,
  // "fallback-cmd": "diff",
}

And this is the .gitconfig I was using to run it as a git difftool:

# https://github.com/afnanenayet/diffsitter
[difftool "diffsitter"]
  cmd = diffsitter "$LOCAL" "$REMOTE"

# https://github.com/afnanenayet/diffsitter
[difftool "diffsitter-debug"]
  cmd = diffsitter --debug "$LOCAL" "$REMOTE"

Running it with --debug allows us to see the steps it takes during processing, and how long they take. On our small example file above, the output looks like this:

diffsitter debug output from minimal example
⇒ git difftool --tool diffsitter-debug --color=always HEAD~1 HEAD -- unpacked/_next/static/\[buildHash\]/_buildManifest.js
 2024-01-30T07:09:30.627Z DEBUG diffsitter > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-oujkn7/_buildManifest.js can be parsed
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:09:30.627Z DEBUG diffsitter           > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-s3LsC2/_buildManifest.js can be parsed
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:09:30.627Z DEBUG diffsitter           > Extensions for both input files are supported
 2024-01-30T07:09:30.627Z DEBUG libdiffsitter        > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-oujkn7/_buildManifest.js to string
 2024-01-30T07:09:30.627Z INFO  libdiffsitter        > Will deduce filetype from file extension
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:09:30.627Z DEBUG libdiffsitter::parse > Constructed parser
 2024-01-30T07:09:30.628Z DEBUG libdiffsitter::parse > Parsed AST
 2024-01-30T07:09:30.628Z INFO  TimerFinished        > parse::parse_file(), Elapsed=968.896µs
 2024-01-30T07:09:30.628Z DEBUG libdiffsitter        > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-s3LsC2/_buildManifest.js to string
 2024-01-30T07:09:30.628Z INFO  libdiffsitter        > Will deduce filetype from file extension
 2024-01-30T07:09:30.628Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:09:30.628Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:09:30.628Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:09:30.628Z DEBUG libdiffsitter::parse > Constructed parser
 2024-01-30T07:09:30.629Z DEBUG libdiffsitter::parse > Parsed AST
 2024-01-30T07:09:30.629Z INFO  TimerFinished        > parse::parse_file(), Elapsed=590.376µs
 2024-01-30T07:09:30.629Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=388.582µs
 2024-01-30T07:09:30.630Z INFO  TimerFinished        > ast::process(), Elapsed=1.083248ms
 2024-01-30T07:09:30.630Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=309.698µs
 2024-01-30T07:09:30.631Z INFO  TimerFinished        > ast::process(), Elapsed=997.049µs
 2024-01-30T07:09:30.631Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=127.357µs
 2024-01-30T07:09:30.631Z INFO  libdiffsitter::render::unified > Using stack style vertical for title
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 80 - 80)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 4
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 80
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 80
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 80 - 80)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 80 - 80)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 4
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 80
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 80
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 80 - 80)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 138 - 138)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 5
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 138
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 138
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 138 - 138)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 138 - 138)
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Title string has length of 5
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > Printing line 138
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End line 138
 2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 138 - 138)
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-oujkn7/_buildManifest.js
/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-s3LsC2/_buildManifest.js
===================================================================================

80:
---
-     "/search": ["static/chunks/pages/search-8da35bbb0f092dc3.js"],

80:
---
+     "/search": ["static/chunks/pages/search-d835393483b5432a.js"],

138:
----
+   "static/chunks/5054-e2060ddbea2abdb7.js"

138:
----
-   "static/chunks/5054-8ad3d13d663a6185.js"

Reducing that output to just show the relevant timing events:

// Start
2024-01-30T07:09:30.627Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
// ..snip..
2024-01-30T07:09:30.628Z INFO  TimerFinished        > parse::parse_file(), Elapsed=968.896µs
// ..snip..
2024-01-30T07:09:30.629Z INFO  TimerFinished        > parse::parse_file(), Elapsed=590.376µs
2024-01-30T07:09:30.629Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=388.582µs
2024-01-30T07:09:30.630Z INFO  TimerFinished        > ast::process(), Elapsed=1.083248ms
2024-01-30T07:09:30.630Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=309.698µs
2024-01-30T07:09:30.631Z INFO  TimerFinished        > ast::process(), Elapsed=997.049µs
2024-01-30T07:09:30.631Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=127.357µs
// ..snip..
2024-01-30T07:09:30.631Z DEBUG libdiffsitter::render::unified > End hunk (lines 138 - 138)
// Finish

We can then see where the real performance hit is when running this against a MUCH larger file/diff (the file being diffed is ~8.4MB; ~249,128 lines long)

diffsitter debug output from large example
⇒ du -sh unpacked/_next/static/chunks/pages/_app.js
8.4M	unpacked/_next/static/chunks/pages/_app.js

⇒ cat unpacked/_next/static/chunks/pages/_app.js | wc -l
  249128

⇒ git difftool --tool diffsitter-debug HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js
 2024-01-30T07:07:58.282Z DEBUG diffsitter > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-wpgXXK/_app.js can be parsed
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:07:58.282Z DEBUG diffsitter           > Checking if /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-6fi10i/_app.js can be parsed
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:07:58.282Z DEBUG diffsitter           > Extensions for both input files are supported
 2024-01-30T07:07:58.287Z DEBUG libdiffsitter        > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-wpgXXK/_app.js to string
 2024-01-30T07:07:58.287Z INFO  libdiffsitter        > Will deduce filetype from file extension
 2024-01-30T07:07:58.291Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:07:58.291Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:07:58.291Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:07:58.291Z DEBUG libdiffsitter::parse > Constructed parser
 2024-01-30T07:07:59.790Z DEBUG libdiffsitter::parse > Parsed AST
 2024-01-30T07:07:59.791Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.50364822s
 2024-01-30T07:07:59.795Z DEBUG libdiffsitter        > Reading /var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T//git-blob-6fi10i/_app.js to string
 2024-01-30T07:07:59.795Z INFO  libdiffsitter        > Will deduce filetype from file extension
 2024-01-30T07:07:59.800Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
 2024-01-30T07:07:59.800Z INFO  libdiffsitter::parse > Using tree-sitter parser for language typescript
 2024-01-30T07:07:59.800Z INFO  libdiffsitter::parse > Succeeded loading grammar for typescript
 2024-01-30T07:07:59.800Z DEBUG libdiffsitter::parse > Constructed parser
 2024-01-30T07:08:01.160Z DEBUG libdiffsitter::parse > Parsed AST
 2024-01-30T07:08:01.160Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.364231801s
 2024-01-30T07:08:01.985Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=825.767557ms
 2024-01-30T07:08:02.972Z INFO  TimerFinished        > ast::process(), Elapsed=1.812168007s
 2024-01-30T07:08:03.814Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=841.781925ms
 2024-01-30T07:08:04.787Z INFO  TimerFinished        > ast::process(), Elapsed=1.815188386s
 2024-01-30T07:20:22.478Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=737.685108096s
 2024-01-30T07:20:22.502Z INFO  libdiffsitter::render::unified > Using stack style horizontal for title
 2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > Printing hunk (lines 52698 - 52698)
 2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > Title string has length of 7
 2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > Printing line 52698
 2024-01-30T07:20:22.502Z DEBUG libdiffsitter::render::unified > End line 52698

// ..snip 17,524 lines..

 2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96300
 2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > End line 96300
 2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96301

// I'm not sure if this actually finished successfully, or just silently crashed at this point... I think it might have crashed, as I would have expected to see a log line like:
//   DEBUG libdiffsitter::render::unified > End hunk
// As well as the actual diff output

Reducing that output to just show the relevant timing events:

// Start
2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
// ..snip..
2024-01-30T07:07:59.791Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.50364822s
// ..snip..
2024-01-30T07:08:01.160Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.364231801s
2024-01-30T07:08:01.985Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=825.767557ms
2024-01-30T07:08:02.972Z INFO  TimerFinished        > ast::process(), Elapsed=1.812168007s
2024-01-30T07:08:03.814Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=841.781925ms
2024-01-30T07:08:04.787Z INFO  TimerFinished        > ast::process(), Elapsed=1.815188386s
2024-01-30T07:20:22.478Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=737.685108096s
// ..snip..
2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96301
// Finish

We can see that ~12.29 minutes was spent in diff::compute_edit_script(), and then it looks like the script might have silently crashed, as I didn't see the diff output, nor the final DEBUG libdiffsitter::render::unified > End hunk line that I would have expected to see.

Note: I was piping this command to subl (Sublime Text), despite what the code block above was edited to look like; I haven't tried running this again yet printing directly to STDOUT/a file. I also haven't tried running this without debug mode to see if it somehow doesn't crash that way.

As a point of comparison, difftastic was seemingly able to diff this file in ~2.46sec total:

time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD --   2.63s user 0.46s system 47% cpu 6.494 total
subl  0.01s user 0.02s system 0% cpu 6.746 total

Though with a lot of this printed among the diff output:

_app.js --- 1/674 --- Text (8.39 MiB exceeded DFT_BYTE_LIMIT)

It seems that when DFT_BYTE_LIMIT is exceeded, difftastic falls back to just doing a basic text diff, so those above timings aren't really a fair comparison. We can override this with --byte-limit and DFT_BYTE_LIMIT:

⇒ difftastic -h

Difftastic 0.52.0

..snip..

OPTIONS:

..snip..

        --byte-limit <LIMIT>           Use a text diff if either input file exceeds this size. [env:
                                       DFT_BYTE_LIMIT=] [default: 1000000]

..snip..

        --graph-limit <LIMIT>          Use a text diff if the structural graph exceed this number of
                                       nodes in memory. [env: DFT_GRAPH_LIMIT=] [default: 3000000]

Let's set this to something much higher than we need for now:

20*1024*1024 = 20971520

Updating our .gitconfig:

# https://github.com/Wilfred/difftastic
[difftool "difftastic"]
  cmd = difft --byte-limit 20971520 "$LOCAL" "$REMOTE"

Running that command again:

time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD --   12.42s user 1.10s system 79% cpu 17.043 total
subl  0.01s user 0.02s system 0% cpu 17.248 total

It takes a bit longer, but then we hit a different error/limit:

_app.js --- 1/674 --- Text (2 JavaScript parse errors, exceeded DFT_PARSE_ERROR_LIMIT)

From the help:

⇒ difft --help | subl

..snip..

        --parse-error-limit <LIMIT>
            Use a text diff if the number of parse errors exceeds this value.
            
            [env: DFT_PARSE_ERROR_LIMIT=]
            [default: 0]

There seem to be 674 matches of DFT_PARSE_ERROR_LIMIT in the output, which corresponds to the 674 chunks of output.

I didn't see any specific errors output related to the parsing though.. I wonder how we can see what the actual issue was?

It seems we can dump the AST either from tree-sitter directly, or the difftastic filtered version of it as a debug:

⇒ difft --help | subl

..snip..

DEBUG OPTIONS:
        --dump-syntax <PATH>
            Parse a single file with tree-sitter and display the difftastic syntax tree.

        --dump-ts <PATH>
            Parse a single file with tree-sitter and display the tree-sitter parse tree.

Full tree-sitter AST:

time difft --dump-ts unpacked/_next/static/chunks/pages/_app.js > _app.js-difftastic-tree-sitter.ast
difft --dump-ts unpacked/_next/static/chunks/pages/_app.js >   5.08s user 9.49s system 95% cpu 15.312 total

⇒ du -h _app.js-difftastic-tree-sitter.ast
224M	_app.js-difftastic-tree-sitter.ast

⇒ cat _app.js-difftastic-tree-sitter.ast| wc -l
 2533687

difftastic 'syntax' AST:

time difft --dump-syntax unpacked/_next/static/chunks/pages/_app.js > _app.js-difftastic.ast
difft --dump-syntax unpacked/_next/static/chunks/pages/_app.js >   434.80s user 43.33s system 93% cpu 8:29.31 total

⇒ du -h _app.js-difftastic.ast
2.2G	_app.js-difftastic.ast

⇒ cat _app.js-difftastic.ast | wc -l
 10571815

@0xdevalias
Copy link
Owner Author

0xdevalias commented Feb 4, 2024

After the above explorations, I ended up taking a different tact and started exploring the 'post processing git diff' approach.

Initial PoC tests seemed to show some merit, so I hacked them together into a script that can filter an existing git diff passed in from a file, or STDIN:

  • https://twitter.com/_devalias/status/1752260576066301956
    • Some alternative potential workarounds I've considered are: pre-processing the files to standardize their variable/function names; and/or post-processing the diff output to try and detect when the only changes in a chunk are variable/function names; then suppressing that chunk.

  • https://twitter.com/_devalias/status/1752646128372535530
    • After exploring/playing around with a bunch of tools yesterday, and a hyperfocus of hacking out some PoC code this afternoon/evening, I think I have the elements of a workable solution! 🎉

      Just need to pull it all together and refine it once the 🧠 is refreshed/rested..

  • https://twitter.com/_devalias/status/1753322272293896385
    • The code is still pretty hacky.. but pulled the PoC's together into a usable script for filtering the noise in the diff output!

      Still not perfect.. but I am really liking these stats so far!

      For an 8.4mb _app.js file (250,022 lines):

      Original diff: 33,399
      Filtered: 7,516

  • https://twitter.com/_devalias/status/1753322275984867513
    • There are still a bunch of areas where my partial AST parsing is getting errors, which means more noise will slip through till I can fix those.. but this is already super useful!

      Hopefully will get a version of it cleaned up/committed/pushed sooner rather than later.

  • https://twitter.com/_devalias/status/1753711267137892684
    • A few more updates that I posted elsewhere as I was going; from 7100, to 6455, to 4268, and now down to 3913 lines.

      • Added a few more ‘pre AST parsing’ fixups that get the diff down even further: from the previous 7100 lines to 6455 lines

      • Trialled parsing the AST with a more lenient parser as a pre-step; filtering out unknown bits, then generating that back into code; that I then apply some of my previous ‘preparation hacks’ to, before parsing it with the more strict parser and doing the identifier normalisation.

        That’s currently reduced the 6455 lines down to 4268 lines; and there are still leftover parsing errors that could potentially get that number lower still that I’m chipping away at.

        Another alternative could be to completely replace the stricter parser with the more lenient one; but then I would have to rewrite a bunch of other code as well.

        I could also potentially try a different lenient parser; or pre-process the code being passed to this one; as I believe it’s failing to parse some statements even; which is potentially leading to more of that noise filtering through..

      • I added a pre-processing step to the lenient parser which improved things a bit; then I added an actual fix to the AST from the lenient parser to help fix more complex ternaries that were challenging to fix in my simpler string append pre/post fix methods.

        With those, we’re now down to 3913 lines.

        Being able to properly detect and fix object properties that are missing their wrapping object would probably fix a bunch more of the issues.

        Though maybe when I look at this next; I’ll just see whether using the lenient parser on its own gives a quicker/better result.


Edit: Newer updates to this Twitter thread continue in this comment below: #3 (comment)

@0xdevalias 0xdevalias added the enhancement New feature or request label Feb 24, 2024
@0xdevalias
Copy link
Owner Author

0xdevalias commented Feb 24, 2024

With the current PoC implementations in the diff minimiser, we're grouping by the diff chunks, and then by added/removed lines within those chunks.

Sometimes a section of code will churn and move a large amount of lines from one chunk, to another chunk, which won't get noticed by the minimiser currently.

It would be good if we could process these in a similar way to git diff's color-moved, with our 'ignore identifier changes', so that we can suppress both the 'added' and 'removed' chunks if they're otherwise unchanged (or minimise the diff between them to only show what has changed, rather than showing them as large 'disconnected blocks')


Edit: See also:

--

VSCode recently added a new feature called move code detection. The algorithm is surprisingly simple: after normal diff, calculate similarity(deletion, insertion) for every deletion & insertion pair. The code can be found here https://github.com/microsoft/vscode/blob/166097a20cbd06d10d255ef561837c439f372de3/src/vs/editor/common/diff/defaultLinesDiffComputer/computeMovedLines.ts#L44-L85.

But intuitively demonstrating the moving relationship on TUI might be a challenge.

Originally posted by @QuarticCat in Wilfred/difftastic#508 (comment)

--

Have you looked into git diff's --color-moved at all?

Personally I tend to use the zebra mode of it. From my git aliases:

  # https://git-scm.com/docs/git-diff#Documentation/git-diff.txt---color-movedltmodegt
  # https://git-scm.com/docs/git-diff#Documentation/git-diff.txt---color-moved-wsltmodesgt
  # https://git-scm.com/docs/git-config#Documentation/git-config.txt-color
  # https://git-scm.com/docs/git-config#Documentation/git-config.txt-colordiffltslotgt
  diff-refactor = \
    -c color.diff.oldMoved='white dim' \
    -c color.diff.oldMovedAlternative='white dim' \
    -c color.diff.newMoved='white dim' \
    -c color.diff.newMovedAlternative='white dim' \
    -c color.diff.newMovedDimmed='white dim' \
    -c color.diff.newMovedAlternativeDimmed='white dim' \
    diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal

Originally posted by @0xdevalias in Wilfred/difftastic#539 (comment)

@0xdevalias
Copy link
Owner Author

A lot of this will probably be uninteresting/useless noise, but I just made a repo and pushed all of my hacky WIP/PoC AST exploration scripts to it so they're captured in one place:

What might be of interest though, is that I pushed the current hacky state of my diff minimiser code there too. I haven't checked it for a good few weeks, so can't even remember if it's in a runnable state currently, but if nothing else, there might be some ideas hidden away in there that help explain what I was referring to above:

When I get a chance to get back to them, I plan to finish my research/refinements on the best method(s), and then clean it all up back to a single useful script that I will commit to this repo. But until then, those may be of some interest to someone.

Originally posted by @0xdevalias in #10 (comment)

0xdevalias referenced this issue in 0xdevalias/poc-ast-tools Mar 28, 2024
@0xdevalias
Copy link
Owner Author

0xdevalias commented Apr 24, 2024

Here's a fun little hack I just thought up (while the diff minimiser script still breaks --color-moved):

  • Diff 2 files, run them through the minimiser, save that diff to a file
  • Apply the output of the minimised diff to the original file, save that to a file
  • Re-run diff with --color-moved between the original file, and the original+minimisedDiffPatch file

Currently this doesn't work fully, as I think I'm not properly updating the line metadata in the diff hunks when I've removed sections from them in the diff minimiser.. but something like this (in theory):

⇒ git diff 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js 6eab9108d9ed3124b4ba757ee9f29e892082deb5:unpacked/_next/static/chunks/pages/_app.js | ./scripts/ast-poc/diff-minimiser.js 2>/dev/null | \
sed 's|a/unpacked/_next/static/chunks/pages/_app.js|a/_app.original.js|g' | \
sed 's|b/unpacked/_next/static/chunks/pages/_app.js|b/_app.patched.js|g' \
> _app.minimised.diff

⇒ git show 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js > _app.original.js

⇒ git apply _app.minimised.diff

⇒ git -c color.diff.oldMoved='white dim' \
    -c color.diff.oldMovedAlternative='white dim' \
    -c color.diff.newMoved='white dim' \
    -c color.diff.newMovedAlternative='white dim' \
    -c color.diff.newMovedDimmed='white dim' \
    -c color.diff.newMovedAlternativeDimmed='white dim' \
    diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.js

Edit: I wrote a new scripts/fix-diff-headers.js (f7deec7) script that seems to get it closer to working..

⇒ ./scripts/fix-diff-headers.js _app.minimised.diff _app.fixed.diff
Reading diff from _app.minimised.diff and writing corrected diff to _app.fixed.diff
Diff successfully corrected and written to _app.fixed.diff

⇒ patch _app.original.js _app.fixed.diff -o _app.patched.js
patching file _app.original.js
6 out of 90 hunks failed--saving rejects to _app.patched.js.rej

⇒ git -c color.diff.oldMoved='white dim' \
    -c color.diff.oldMovedAlternative='white dim' \
    -c color.diff.newMoved='white dim' \
    -c color.diff.newMovedAlternative='white dim' \
    -c color.diff.newMovedDimmed='white dim' \
    -c color.diff.newMovedAlternativeDimmed='white dim' \
    diff --ignore-blank-lines --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space --minimal 6d96656142aabcc862846ad7852422f0a8f14dbe:unpacked/_next/static/chunks/pages/_app.js _app.patched.js

That way, instead of just seeing a giant chunk of 'removed' and a giant chunk of 'added' when this module moved; we can see the dimmed lines which were moved (but otherwise unchanged), and spend less time/effort having to consider them:

image

An even better version of this would be a diff --color-moved implementation that was able to ignore identifier names changing entirely (in which case I suspect it would mark this entire block as 'moved but not changed'); but as far as I am currently aware, the only way to do that would be to basically re-implement the whole diff --color-moved logic ourselves; which would basically end up with us writing our own AST differ I believe.

Though the way I was originally intending on fixing/improving the diff minimiser script was to try just maintaining the ANSI colouring of the input lines, and outputting the lines with the same colouring at the end.

Originally posted by @0xdevalias in #10 (comment)

@0xdevalias
Copy link
Owner Author

0xdevalias commented Dec 14, 2024

From some Twitter chat with @michaelskyba, talking about git diff --color-moved and related:

Edit (2025-04-10): Decided to see how Claude 3.7 does with looking at diff.c and explaining how --color-moved works:

  • https://claude.ai/share/9fac6b65-2865-458d-9e60-e29b7dbb89f1
    • Claude Response

      The git diff --color-moved feature is designed to help developers more easily identify code blocks that have been moved within a file. Let me walk through how it's implemented in the Git codebase based on the provided diff.c file.

      Overview of the Implementation

      The implementation involves several key components:

      1. Detection of moved lines: Finding blocks of lines that appear in different locations
      2. Marking those blocks with special flags: To indicate they are moved content
      3. Applying special coloring: Rendering the moved blocks with distinct colors
      4. Special handling for whitespace changes: Optional handling for whitespace differences

      Core Data Structures

      Key data structures for this functionality include:

      • struct emitted_diff_symbol: Represents a line in the diff output with metadata
      • struct moved_entry: Represents a line from the original content that might have been moved
      • struct moved_block: Represents a contiguous block of lines that moved together

      Detailed Implementation Flow

      1. Configuration and Setup

      When a user specifies --color-moved[=<mode>], the diff_opt_color_moved function processes this option:

      static int diff_opt_color_moved(const struct option *opt,
                                    const char *arg, int unset)
      {
          struct diff_options *options = opt->value;
      
          if (unset) {
              options->color_moved = COLOR_MOVED_NO;
          } else if (!arg) {
              if (diff_color_moved_default)
                  options->color_moved = diff_color_moved_default;
              if (options->color_moved == COLOR_MOVED_NO)
                  options->color_moved = COLOR_MOVED_DEFAULT;
          } else {
              int cm = parse_color_moved(arg);
              if (cm < 0)
                  return error(_("bad --color-moved argument: %s"), arg);
              options->color_moved = cm;
          }
          return 0;
      }

      The color_moved field can have several values, including:

      • COLOR_MOVED_NO: Disabled
      • COLOR_MOVED_DEFAULT: Default coloring
      • COLOR_MOVED_BLOCKS: Color only the old/new blocks
      • COLOR_MOVED_ZEBRA: Alternating colors for different blocks
      • COLOR_MOVED_ZEBRA_DIM: Like zebra but with dimmed uninteresting parts

      2. Diff Generation Process

      During the diff generation, when the color-moved feature is active:

      1. The system first generates all the diff symbols normally
      2. Instead of directly emitting them, it collects them in options->emitted_symbols
      3. After all symbols are collected, it performs the move detection
      4. Finally, it renders the symbols with appropriate coloring

      3. Move Detection Algorithm

      The move detection happens in diff_flush_patch_all_file_pairs:

      if (o->color_moved) {
          struct mem_pool entry_pool;
          struct moved_entry_list *entry_list;
      
          mem_pool_init(&entry_pool, 1024 * 1024);
          entry_list = add_lines_to_move_detection(o, &entry_pool);
          mark_color_as_moved(o, entry_list);
          if (o->color_moved == COLOR_MOVED_ZEBRA_DIM)
              dim_moved_lines(o);
      
          mem_pool_discard(&entry_pool, 0);
          free(entry_list);
      }

      The core of the algorithm is in add_lines_to_move_detection() and mark_color_as_moved():

      a. Line Hashing and Identification

      add_lines_to_move_detection() builds a specialized data structure that:

      • Scans all added and removed lines in the diff
      • Hashes the content of each line for efficient comparison
      • Groups lines into a hashmap to quickly find matching lines
      • Connects adjacent lines to help identify blocks

      b. Block Matching

      mark_color_as_moved() uses the data structure to:

      • Find blocks of lines that were moved
      • Mark these blocks with special flags (like DIFF_SYMBOL_MOVED_LINE)
      • Alternate between different styles for adjacent blocks (when zebra mode is on)
      static void mark_color_as_moved(struct diff_options *o,
                                    struct moved_entry_list *entry_list)
      {
          // Find matching blocks of moved code
          // Mark them with DIFF_SYMBOL_MOVED_LINE flags
          // Apply alternating styles when in zebra mode
      }

      c. Special Whitespace Handling

      If color_moved_ws_handling is enabled, the code adds special handling for whitespace changes:

      if (o->color_moved_ws_handling & COLOR_MOVED_WS_ALLOW_INDENTATION_CHANGE)
          fill_es_indent_data(l);

      This allows moved blocks with only whitespace changes to still be recognized.

      4. Rendering the Output

      Finally, when rendering the diff, different colors are applied based on the flags:

      case DIFF_SYMBOL_PLUS:
          switch (flags & (DIFF_SYMBOL_MOVED_LINE |
                         DIFF_SYMBOL_MOVED_LINE_ALT |
                         DIFF_SYMBOL_MOVED_LINE_UNINTERESTING)) {
          case DIFF_SYMBOL_MOVED_LINE |
               DIFF_SYMBOL_MOVED_LINE_ALT |
               DIFF_SYMBOL_MOVED_LINE_UNINTERESTING:
              set = diff_get_color_opt(o, DIFF_FILE_NEW_MOVED_ALT_DIM);
              break;
          case DIFF_SYMBOL_MOVED_LINE |
               DIFF_SYMBOL_MOVED_LINE_ALT:
              set = diff_get_color_opt(o, DIFF_FILE_NEW_MOVED_ALT);
              break;
          /* ... other cases for other combinations of flags ... */
          }

      Summary

      The --color-moved feature implements a sophisticated algorithm that:

      1. Identifies content blocks that appear to have moved rather than being deleted/added
      2. Marks them with special flags
      3. Renders them with distinctive colors to make the movement visually apparent
      4. Optionally handles whitespace differences intelligently

      This helps developers understand diffs more quickly by visually distinguishing between content that's genuinely new/removed and content that's simply been relocated.


Some more Twitter chat with @michaelskyba, talking about diff minimisation:

@0xdevalias
Copy link
Owner Author

0xdevalias commented Apr 10, 2025

https://semanticdiff.com/

  • semanticdiff.com/blog/semanticdiff-vs-difftastic

Not sure if I realised this earlier and forgot, but seems semanticdiff has an online version that allows us to see GitHub commits/PRs/etc (and from a quick peek in DevTools/curl, seems that it's API could be potentially used):

Here's an example of the last build I committed to chatgpt-source-watch with GitHub's diff, and SemanticDiff's:

It seemed to work ok on some small files (that were already not too bad with GitHub), but when I tried to get it to parse some of the bigger chunk files, it generally just gave an error:

The computation took too long.

{
  "type": "error",
  "error": {
    "type": "timeout_error",
    "message": "The provided timeout was reached"
  }
}

@0xdevalias
Copy link
Owner Author

0xdevalias commented Apr 10, 2025

From Twitter / X:

@0xdevalias 0xdevalias changed the title Explore AST based diff tools Explore AST based diff tools, diff minimisation, etc Apr 10, 2025
@0xdevalias
Copy link
Owner Author

0xdevalias commented Apr 10, 2025

RE: My 'post processing git diff' diff minimiser, as referenced in previous comments such as:

Since it's been a while since I looked at this, and the state of LLMs being able to handle large/complex code has come a long way since the start of 2024 (When OpenAI's GPT-4 was... not that great for it), I figured I'd ask a few different LLMs to evaluate the diff minimiser code with this prompt, and attaching the following two files:

What are these scripts attempting to do, how effective do they seem at it, and how can we make them better?

Here are the results for that single starter prompt, with no deeper follow up questions:

  • Claude 3.7:
    • https://claude.ai/share/fdb6da96-ea24-4e1f-b58a-8d424e560d41

    • Response

      I'll analyze these JavaScript scripts to understand their purpose, effectiveness, and potential improvements.

      What These Scripts Are Attempting to Do

      These scripts (diff-minimiser.js and diff-minimiser-poc-acorn.js) are trying to build a tool that intelligently filters Git diffs to reduce noise from non-meaningful changes. They specifically target scenarios like:

      1. When minified JavaScript variable names change but the code functionality remains identical
      2. When code is structurally equivalent after normalization of identifiers
      3. When changes are merely cosmetic or represent refactoring with no semantic difference

      The core concept is to parse code fragments from diff chunks, normalize identifiers (renaming all variables to generic names like symbol1, symbol2), and then compare the normalized versions. If they're identical after normalization, the tool filters them out as "non-meaningful" changes.

      How Effective They Seem

      The scripts show significant promise but are still in development phase:

      1. Core Functionality: They successfully parse Git diffs, group related changes, normalize JavaScript code fragments, and can filter out chunks with only identifier changes.

      2. Parser Robustness: They use multiple parsing approaches (Babel parser, acorn's loose parser, tokenization) to handle partial code fragments which is smart since diffs often contain incomplete code.

      3. Debugging: The extensive debug logging shows the authors are actively testing and refining the algorithm.

      4. Statistics: The code calculates effectiveness metrics like how many groups were identified as having equivalent normalized code.

      The main challenge appears to be parsing incomplete/partial JavaScript snippets from diffs reliably, which is inherently difficult. The extensive TODOs in diff-minimiser.js indicate several known edge cases that need addressing.

      How to Make Them Better

      1. Refine the parsing strategy:

        • Fully leverage the acornLoose parser which seems most promising for handling fragments
        • Implement the tokenizer-based approach (normalizeIdentifierNamesInCodeV3) as the default since it appears more robust than full AST parsing
      2. Improve the diff grouping algorithm:

        • Implement the suggested compareNormalizedCodeDiffsV2 to handle context lines better
        • Use the diffArrays results to identify minimal sets of actual changes
      3. Handle more edge cases:

        • Implement detection for shifted lines (same code moved up/down slightly)
        • Add pattern recognition for common minifier patterns like variable renaming sequences
      4. Code structure improvements:

        • Refactor the main logic into a more modular design
        • Implement a proper streaming processor rather than loading entire diffs in memory
        • Consolidate the multiple parsing approaches into a single strategy with fallbacks
      5. Testing and validation:

        • Add unit tests for each component (parser, normalizer, differ)
        • Create a benchmark dataset of typical diffs to measure improvement
      6. Additional features:

        • Add command-line options to control filtering strictness
        • Support more languages beyond JavaScript
        • Preserve Git's color-moved feature when present

      Specific Implementation Improvements

      1. Complete the arrayDiff processing in compareNormalizedCodeDiffs to minimize the changes returned based on token-level differences

      2. Implement the planned consolidated approach of filtering chunks:

        function processChunk(chunk) {
          const groups = groupChunkChanges(chunk.changes);
          const filteredGroups = groups.filter(group => {
            if (group.isContextGroup) return true;
            if (!group.isEqualModifiedLinesCount) return true;
            
            const normalizedResult = compareNormalizedCodeDiffs(group.changes);
            return !normalizedResult.isNormalisedCodeEqual;
          });
          
          return reconstructChunkWithFilteredGroups(chunk, filteredGroups);
        }
      3. Prioritize the tokenizer approach over full AST parsing for better robustness with code fragments:

        function normalizeCode(code) {
          try {
            return normalizeIdentifierNamesInCodeV3(code).mappedTokensString;
          } catch (err) {
            try {
              return normalizeIdentifierNamesInCodeV2(code);
            } catch (err2) {
              return code; // Fallback to original if parsing fails
            }
          }
        }

      These scripts show a sophisticated approach to a real developer pain point, and with the improvements outlined above, could become an essential tool for developers working with minified JavaScript codebases.

  • Gemini 2.5 Pro Preview 03-25 (via Google AI Studio)
    • https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%5B%221R2N_Sgb09cV8WYaey9t9Y7ysMndFVjFK%22%5D,%22action%22:%22open%22,%22userId%22:%22102649988257823587404%22,%22resourceKeys%22:%7B%7D%7D&usp=sharing,

    • Response

      Okay, let's break down these two scripts.

      Overall Goal

      Both scripts are working towards the same ultimate goal: minimizing JavaScript diffs, especially for minified or transpiled code, by identifying and suppressing changes that are likely due to superficial differences like variable renaming, rather than significant logic changes. The core idea is to parse or tokenize code snippets from a diff, normalize them (e.g., rename all variables to generic placeholders like symbol1, symbol2), and then compare the normalized versions. If the normalized versions of an added block and a removed block are identical, the script assumes the change was non-functional and filters it out of the final diff output.


      Analysis of diff-minimiser-poc-acorn.js

      1. What it's attempting to do:

        • This script is a Proof-of-Concept (POC) specifically exploring the use of the acorn-loose parser.
        • acorn-loose is designed to be error-tolerant, making it potentially suitable for parsing potentially incomplete or slightly malformed code snippets often found in diff chunks.
        • The script demonstrates a workflow:
          1. Parse potentially broken code with acorn-loose.
          2. Use estraverse to traverse the resulting Abstract Syntax Tree (AST) and remove "dummy" nodes that acorn-loose inserts during error recovery.
          3. Use escodegen to generate JavaScript code back from the cleaned Acorn AST.
          4. (Potentially) Take the regenerated code and parse it again with a stricter parser like @babel/parser (though the immediate benefit of this last step within the POC isn't fully realized).
        • It also includes stubbed/POC versions of normalization functions (normalizeIdentifierNamesInCode) using Babel, likely carried over from other experiments.
      2. How effective it seems:

        • As a POC: It's effective in demonstrating the technical feasibility of using acorn-loose -> estraverse -> escodegen to process potentially broken code snippets and generate potentially cleaner, more valid code.
        • Practicality: It doesn't actually use this capability to minimize a diff. It works on a hardcoded snippet and just logs the intermediate steps. It proves the tools can work together but doesn't apply them to the actual problem yet.
      3. How to make it better:

        • Integration: The useful techniques demonstrated here (using acorn-loose for error tolerance) should be integrated as one possible strategy within the main diff-minimiser.js script, specifically within the normalization functions (as seems to have been attempted in normalizeIdentifierNamesInCodeV2).
        • Clarify Goal: The purpose of the final Babel parse/generate step within this specific POC is unclear. If the goal is just to get a cleaned AST/code string from acorn-loose, the Babel steps might be unnecessary overhead unless the intention is specifically to leverage Babel's normalization or transformation capabilities after the initial cleanup.
        • Input: Replace the hardcoded snippet with logic to accept dynamic input (e.g., diff chunks).

      Analysis of diff-minimiser.js

      1. What it's attempting to do:

        • This is the main script attempting the full diff minimization task.
        • It reads a diff from a file or stdin.
        • It uses parse-diff to parse the unified diff format into structured data (files, chunks, changes).
        • It iterates through chunks and groups consecutive changes, paying special attention to groups containing both additions (+) and deletions (-).
        • For groups with an equal number of added and deleted lines, it extracts the code content of these lines.
        • It normalizes the extracted added code and deleted code separately. It explores three main normalization strategies:
          • normalizeIdentifierNamesInCode: Uses Babel parser/traverse/generator. Prone to failing on invalid snippets, requires complex prepareCodeForASTParsing helpers.
          • normalizeIdentifierNamesInCodeV2: Uses acorn-loose parser, estraverse, escodegen. More error-tolerant but still requires AST manipulation and preparation (prepareCodeForASTParsingV2).
          • normalizeIdentifierNamesInCodeV3: (Currently Used) Uses acorn.tokenizer. Avoids full parsing, tokenizes the code, and normalizes only identifier tokens (specifically short ones <= 2 chars). This is likely the most robust approach for handling snippets that won't parse cleanly.
        • It compares the normalized strings (normalisedCodeByType.del === normalisedCodeByType.add).
        • If they match, it flags the group (isNormalisedCodeEqual = true).
        • It filters out groups flagged as equal (and potentially surrounding context lines if they become isolated).
        • It reconstructs and prints the filtered diff.
        • It includes experiments using jsdiff's diffArrays to compare normalized lines within a group, although the results aren't currently used to further reduce the diff output.
      2. How effective it seems:

        • Concept: The core concept is solid and addresses a real pain point in reviewing generated code diffs.
        • Diff Parsing: Handles standard diffs correctly.
        • Grouping: Basic grouping of adjacent changes works. It correctly identifies the primary target: adjacent add/del blocks of equal size.
        • Normalization Strategy: The evolution towards the tokenizer-based approach (V3) is a pragmatic and likely effective choice given the difficulty of reliably parsing arbitrary diff snippets. Only normalizing short identifiers is a clever heuristic.
        • Comparison & Filtering: The comparison of normalized strings and subsequent filtering appears logically sound based on the isNormalisedCodeEqual flag.
        • Robustness: The script's main weakness lies in reliably handling the wide variety of potentially broken or incomplete code snippets found in diffs. The prepareCode... functions are complex heuristics that might not always work or could even introduce new errors. The tokenizer approach (V3) mitigates this significantly but is less powerful than full AST analysis.
        • Limitations (Acknowledged in TODOs):
          • Doesn't handle cases where added/deleted blocks are separated by context lines (code movement).
          • Doesn't handle groups where add/del counts differ.
          • Doesn't yet use the diffArrays output to minimize within a group that isn't a perfect match after normalization.
          • Preserving git diff --color-moved information through the pipeline is challenging.
      3. How to make it better:

        • Refine Tokenizer Normalization (V3): This seems the most promising path.
          • Expand Normalization: Consider normalizing whitespace between tokens or other token types if needed.
          • Refine Pre-processing: Ensure prepareCodeForASTParsingV2PreProcess robustly cleans snippets for tokenization, not necessarily full parsing. Balancing brackets might still be useful here.
        • Utilize diffArrays: Implement the logic suggested in the TODOs. After normalizing lines in a group using V3, use diffArrays to compare normalisedDels and normalisedAdds. Reconstruct the group in the output diff including only the lines that actually differ after normalization, potentially shrinking many groups instead of just filtering perfect matches.
        • Code Movement Detection: This is complex.
          • Hashing: Calculate a hash of the normalized token string for each added/deleted line or block. Store these hashes. Look for identical hashes between deleted blocks and added blocks elsewhere in the file.
          • Leverage Git: Explore if git diff --color-moved=... output can be parsed reliably or if there's a way to invoke git plumbing commands to get this info programmatically. This might be the most robust way if feasible.
        • Improve Grouping/Chunk Handling: If code movement is detected, the concept of processing only adjacent add/del blocks needs rethinking. You might need to process adds/dels globally within a file.
        • Refactor & Cleanup: Remove the less effective normalization strategies (V1, V2 unless a specific use case remains) and their associated preparation functions to simplify the codebase.
        • Configuration: Add command-line flags for tuning, e.g., the identifier length threshold for normalization (<= 2), whether to ignore whitespace differences, etc.
        • Error Handling: Improve error reporting for failed normalization/tokenization steps.
        • Output: Ensure the final diff output is valid and that the @@ ... @@ headers accurately reflect the remaining lines and their original positions as closely as possible (this is hard when lines are removed).

      In summary, diff-minimiser.js is a capable script tackling a complex problem. Its effectiveness hinges on the robustness of its code snippet normalization strategy. The current tokenizer-based approach (V3) is a good direction. The next major improvements would be utilizing diffArrays for partial matching within groups and tackling code movement detection. The POC script served its purpose in exploring acorn-loose as a potential tool for this.

@0xdevalias
Copy link
Owner Author

RE: My 'post processing git diff' diff minimiser

Some more updates, from a private chat the other day:

If you're wanting to explore tools in the diff minimisation space, there's the code for what I was working on in the past as well. Some notes/links in this issue/comment:

And the latter part of this one:

This was a hacky side experiment:

This was the main one I was working on:

I haven't touched either in a long time now though.. and LLMs have come a long way since, so they'd likely be able to help improve the code a lot better a lot easier these days

It basically lets git do the initial diff, and then post-processes that to minimise it even further

And then some further updates from today:

Did you get a chance to look at this at all? I just re-ran it on one of my old test files, and it cut the diff down from 33,399 lines to 4,182 lines.

There's still definitely some noise in that.. but even if you then ran that through whatever your normal diff viewer tool is that minimises stuff, it'd cut down on a lot of noise probably

⇒ cat ../chatgpt-source-watch/_app-git.diff | wc -l
33399

⇒ cat ../chatgpt-source-watch/_app-git.diff | ./diff-minimiser.js 2>/dev/null | wc -l
4182

Noticed that it wasn't de-noising identifier churn in class private fields/similar, so made a few tweaks and now that diff is down to:

⇒ cat ../chatgpt-source-watch/_app-git.diff | ./diff-minimiser.js 2>/dev/null | wc -l
3991

These were the relevant changes to get those extra line savings:

  • Not sure if this one was needed specifically for this, but it bumps some of the relevant dependencies (eg. acorn specifically): 0xdevalias/poc-ast-tools@91e94ee
  • Then this change updates the acorn parsing to use ecmaVersion: 'latest' so it supports more modern syntax than it was before: 0xdevalias/poc-ast-tools@994f73b
  • And then this one updates the normalizeIdentifierNamesInCodeV3 function so it handles private identifiers (like a class's #foo): 0xdevalias/poc-ast-tools@8d13027

That last one was the main fix that got that extra 191 lines of noise reduction I mentioned above.

@0xdevalias
Copy link
Owner Author

0xdevalias commented Apr 21, 2025

Adding on to the old (Feb 4, 2024) Twitter thread summarised above in #3 (comment)


Also this standalone RT just in case the old thread wasn't super visible:

As well as a crosspost to BlueSky:

@0xdevalias
Copy link
Owner Author

See the following blog posts for a deeper dive into the Myers and Patience diff algorithms:

Originally posted by @0xdevalias in #3 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants