Skip to content

Fix Unicode matching for regex options #26100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented May 19, 2025

The Pattern class needs to operate on ordinary "Unicode" Java string, whereas both option values and strings matched against the pattern are in Bazel's internal string encoding. This requires reencoding in the right places.

Along the way, optimize literal regex matches - this was missed in 662c0f0.

Fixes #26070

@fmeum fmeum requested a review from a team as a code owner May 19, 2025 14:56
@fmeum fmeum requested review from aranguyen and removed request for a team May 19, 2025 14:56
@github-actions github-actions bot added team-Configurability platforms, toolchains, cquery, select(), config transitions awaiting-review PR is awaiting review from an assigned reviewer labels May 19, 2025
@fmeum fmeum requested review from tjgq and removed request for aranguyen May 19, 2025 14:57
@fmeum fmeum force-pushed the 26070-unicode-regex branch from f70160b to f478b87 Compare May 19, 2025 14:57
@tjgq tjgq added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels May 21, 2025
@meteorcloudy
Copy link
Member

We are encountering https://buildkite.com/bazel/google-bazel-presubmit/builds/92308#0196f4a3-acb1-44c4-8178-f3fa0130833d while importing this, @fmeum can you take a look what's missing?

@fmeum
Copy link
Collaborator Author

fmeum commented May 22, 2025

We are encountering https://buildkite.com/bazel/google-bazel-presubmit/builds/92308#0196f4a3-acb1-44c4-8178-f3fa0130833d while importing this, @fmeum can you take a look what's missing?

Is this running on more than just the Bazel source code? I can't reproduce this issue.

My guess is that you might still be using java_11_library to force Java 11 language level for some libraries, perhaps options. The string_encoding target uses newer features.

@meteorcloudy
Copy link
Member

The change is exported at https://bazel-review.git.corp.google.com/c/bazel/+/281454, but I'm not seeing any obvious difference.

@meteorcloudy
Copy link
Member

I can reproduce with bazel build //src/test/shell/bazel:bazel_java17_test but not by bazel build //src/main/java/com/google/devtools/build/lib/unsafe:string

@meteorcloudy
Copy link
Member

Probably caused by https://cs.opensource.google/bazel/bazel/+/master:src/BUILD;l=549?q=java_tools_java8&ss=bazel, but not sure why presubmit on github didn't fail

@fmeum
Copy link
Collaborator Author

fmeum commented May 22, 2025

bazel build //src/test/shell/bazel:bazel_java17_test

passes for me, but I agree that the Java 8 transition should cause compilation to fail for that archive. I can still build it locally, not sure why.

https://bazel-review.git.corp.google.com/c/bazel/+/281454

is still corp-only, unfortunately.

@meteorcloudy
Copy link
Member

Looks like this change doesn't play well with cff5907 on the master branch. If you rebase, you'll see the failure in Github presubmit.

@fmeum
Copy link
Collaborator Author

fmeum commented May 22, 2025

Right, just found cff5907 as well. I will rebase and fix this.

@fmeum fmeum force-pushed the 26070-unicode-regex branch from f478b87 to de2c470 Compare May 22, 2025 09:16
@meteorcloudy
Copy link
Member

is still corp-only, unfortunately.

It redirect to the corp address for me, the original URL can be found in the Buildkite pipeline:

image

https://bazel-review.googlesource.com/q/55b225e5e6a5c8aa33de575c840efd41983b0462 should probably work?

@fmeum
Copy link
Collaborator Author

fmeum commented May 22, 2025

Yes, that link works.

While I can fix the CI failure quite easily, the java_tools setup still doesn't work properly: It forces the Java language level to 8 in a transition, but keeps the Java 21 runtime as the compilation target. This means that it won't detect usage of newer API, which makes the resulting java_tools incompatible with JDK 8.

@fmeum
Copy link
Collaborator Author

fmeum commented May 22, 2025

I will send a separate PR to fix the version confusion.

CC @hvadehra

Prebuilt Java tools meant to run on the user-specified `--java_{tool_,}runtime_version` must only use Java 8 language features and APIs. Previously, these tools were compiled with `--java_language_version=8`, but had access to a JDK 21 as the bootclasspath. In fact, `junitrunner` used `Stream#dropWhile`, which isn't available in a JDK 8.
Similarly, Java tools meant to run on the Java runtime provided by a Java (compilation) toolchain must only use Java 11 language features and APIs. This was the case, but they were in fact build with `--java_language_version=8` and a JDK 21, just like the other tools, thus lacking validation for this property.

This PR splits tools into two groups and enforces consistent Java language and bootclasspath runtime for them via `--@rules_java//toolchains:incompatible_language_version_bootclasspath`. Constants in `//src:release_archive.bzl` can be used to set the two minimum versions globally.
@fmeum
Copy link
Collaborator Author

fmeum commented May 22, 2025

I rebased onto #26126 and the issue should now be fixed.

@fmeum fmeum force-pushed the 26070-unicode-regex branch from f0dcdc3 to e1ee0fd Compare May 22, 2025 13:10
fmeum added 2 commits May 22, 2025 15:24
The `Pattern` class needs to operate on ordinary "Unicode" Java string, whereas both option values and strings matched against the pattern are in Bazel's internal string encoding. This requires reencoding in the right places.

Along the way, optimize literal regex matches - this was missed in 662c0f0.
@fmeum fmeum force-pushed the 26070-unicode-regex branch from e1ee0fd to 88a4543 Compare May 22, 2025 13:29
@fmeum fmeum marked this pull request as draft May 22, 2025 17:22
@iancha1992
Copy link
Member

@fmeum Could you please take a look at the presubmit errors? Thanks!

@fmeum
Copy link
Collaborator Author

fmeum commented May 22, 2025

@iancha1992 Let's get #26126 reviewed and merged first, then CI can be fixed by rebasing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally team-Configurability platforms, toolchains, cquery, select(), config transitions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regex options match with inconsistent string encodings
4 participants