Skip to content

Check page URLs for extension before direct fetch attempt #830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented May 6, 2025

Fixes #829

  • Only attempt direct fetch (non-browser fetch()) of page URLs with known non-HTML extensions, otherwise attempt loading in the browser. (Can perhaps further optimize to discover new non-HTML extensions)
  • Also: Async fetch dedup: treat unknown status / 206 same as 200 for dedup purposes, to avoid duplicate loading

@ikreymer ikreymer requested a review from tw4l May 6, 2025 00:51
@tw4l
Copy link
Member

tw4l commented May 6, 2025

I wonder if it might be better to direct fetch any URL that ends in a file extension (and that's not .html or .htm, since some older sites followed that convention)? I think introducing a relatively short list of acceptable file formats is going to result in us not fetching a lot of files we'd want to - just off the top of my head, common file extensions that wouldn't get fetched with this implementation would include CSVs, plaintext files, Powerpoint presentations, TIFFs, GIFs, videos in other container formats like .avi/.mov/.mkv, and so on...

If we are going to move forward with an allowlist of extensions, I think we should look for a third party-managed list that would be a bit more comprehensive.

@ikreymer
Copy link
Member Author

ikreymer commented May 6, 2025

I wonder if it might be better to direct fetch any URL that ends in a file extension (and that's not .html or .htm, since some older sites followed that convention)? I think introducing a relatively short list of acceptable file formats is going to result in us not fetching a lot of files we'd want to - just off the top of my head, common file extensions that wouldn't get fetched with this implementation would include CSVs, plaintext files, Powerpoint presentations, TIFFs, GIFs, videos in other container formats like .avi/.mov/.mkv, and so on...

If we are going to move forward with an allowlist of extensions, I think we should look for a third party-managed list that would be a bit more comprehensive.

Yeah, maybe that's a smaller list to maintain, would also include .asp, .php, etc.. Another option is to always try browser load, and then if non-HTML, add extension to direct fetch check list for later..

@tw4l
Copy link
Member

tw4l commented May 7, 2025

Yeah, maybe that's a smaller list to maintain, would also include .asp, .php, etc.. Another option is to always try browser load, and then if non-HTML, add extension to direct fetch check list for later..

Yeah I think we'd be better off avoiding an allowlist altogether. But a shorter "don't direct fetch these extensions" list could work, or going off of your second idea, maybe we just always try browser load and then if it's non-HTML, directly fetch it regardless of extension?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't do direct fetch for every page URL
2 participants