Check page URLs for extension before direct fetch attempt #830

ikreymer · 2025-05-06T00:51:13Z

Fixes #829

Only attempt direct fetch (non-browser fetch()) of page URLs with known non-HTML extensions, otherwise attempt loading in the browser. (Can perhaps further optimize to discover new non-HTML extensions)
Also: Async fetch dedup: treat unknown status / 206 same as 200 for dedup purposes, to avoid duplicate loading

…fetch and just load via the browser

…n, otherwise just load in browser

…duplicate fetches

tw4l · 2025-05-06T14:52:39Z

I wonder if it might be better to direct fetch any URL that ends in a file extension (and that's not .html or .htm, since some older sites followed that convention)? I think introducing a relatively short list of acceptable file formats is going to result in us not fetching a lot of files we'd want to - just off the top of my head, common file extensions that wouldn't get fetched with this implementation would include CSVs, plaintext files, Powerpoint presentations, TIFFs, GIFs, videos in other container formats like .avi/.mov/.mkv, and so on...

If we are going to move forward with an allowlist of extensions, I think we should look for a third party-managed list that would be a bit more comprehensive.

ikreymer · 2025-05-06T17:19:39Z

I wonder if it might be better to direct fetch any URL that ends in a file extension (and that's not .html or .htm, since some older sites followed that convention)? I think introducing a relatively short list of acceptable file formats is going to result in us not fetching a lot of files we'd want to - just off the top of my head, common file extensions that wouldn't get fetched with this implementation would include CSVs, plaintext files, Powerpoint presentations, TIFFs, GIFs, videos in other container formats like .avi/.mov/.mkv, and so on...

If we are going to move forward with an allowlist of extensions, I think we should look for a third party-managed list that would be a bit more comprehensive.

Yeah, maybe that's a smaller list to maintain, would also include .asp, .php, etc.. Another option is to always try browser load, and then if non-HTML, add extension to direct fetch check list for later..

tw4l · 2025-05-07T15:15:34Z

Yeah, maybe that's a smaller list to maintain, would also include .asp, .php, etc.. Another option is to always try browser load, and then if non-HTML, add extension to direct fetch check list for later..

Yeah I think we'd be better off avoiding an allowlist altogether. But a shorter "don't direct fetch these extensions" list could work, or going off of your second idea, maybe we just always try browser load and then if it's non-HTML, directly fetch it regardless of extension?

ikreymer added 3 commits May 5, 2025 17:12

direct fetch optimization: add 'skipDirectFetchByExt' to skip direct …

0582608

…fetch and just load via the browser

optimization: only do direct fetch if filename ends in known extensio…

cc1b52b

…n, otherwise just load in browser

direct fetch dedup: treat 206 and 0 (status unknown) as 200 to avoid …

1fb6c90

…duplicate fetches

ikreymer requested a review from tw4l May 6, 2025 00:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Check page URLs for extension before direct fetch attempt #830

Check page URLs for extension before direct fetch attempt #830

Uh oh!

ikreymer commented May 6, 2025

Uh oh!

tw4l commented May 6, 2025 •

edited

Loading

Uh oh!

ikreymer commented May 6, 2025

Uh oh!

tw4l commented May 7, 2025

Uh oh!

Uh oh!

Uh oh!

Check page URLs for extension before direct fetch attempt #830

Are you sure you want to change the base?

Check page URLs for extension before direct fetch attempt #830

Uh oh!

Conversation

ikreymer commented May 6, 2025

Uh oh!

tw4l commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikreymer commented May 6, 2025

Uh oh!

tw4l commented May 7, 2025

Uh oh!

Uh oh!

tw4l commented May 6, 2025 •

edited

Loading