Skip to content
This repository was archived by the owner on Jun 2, 2025. It is now read-only.

Fix identifier (un)escaping #47

Merged
merged 7 commits into from
May 7, 2025
Merged

Fix identifier (un)escaping #47

merged 7 commits into from
May 7, 2025

Conversation

JanJakes
Copy link
Contributor

@JanJakes JanJakes commented May 6, 2025

The translate_string_literal logic in the SQLite driver that handles "unescaping" (interpreting) of MySQL escape sequences is only applied to a textStringLiteral AST node. It turns out, the same logic is needed in translate_pure_identifier, or, in other words, also for WP_MySQL_Lexer::DOUBLE_QUOTED_TEXT and WP_MySQL_Lexer::BACK_TICK_QUOTED_ID tokens.

Therefore, it's probably better to move this logic to the tokenization phase and make token instances return the correct unquoted token values.

Surfaced in #42.

*
* @return string The token value.
*/
public function get_value(): string {
Copy link
Contributor Author

@JanJakes JanJakes May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I like that using get_value() is lazy and can generally nicely work for any token type where we need to interpret or normalize any values, I'm wondering how to solve the NO_BACKSLASH_ESCAPES SQL mode.

It's a very simple IF, but in the token instance, we just know nothing about SQL modes 🤔 The tokenizer knows it, so it could pass in a flag, or use a different token instance, but that makes it a bit less elegant.

Copy link
Contributor

@adamziel adamziel May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be a constructor argument? The mode is already determined when the token is created. If that was a boolean flag baked into the Token instance, we could still keep the get_value() method argument-less.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Done in 6e5a8f5.

@JanJakes JanJakes requested a review from adamziel May 6, 2025 14:52
* > of pattern-matching contexts, they evaluate to the strings \% and
* > \_, not to % and _.
*/
'\%' => '\\\\%',
Copy link
Contributor

@adamziel adamziel May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using '\\%' instead of \% just to make sure PHP won't surprise us by treating the backslash as an escape sequence? Or, alternatively, $backslash . '%' or '\x5C%'?

Copy link
Contributor

@adamziel adamziel May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought maybe we're good since PHP docs says this:

To specify a literal single quote, escape it with a backslash (\). To specify a literal backslash, double it (\\). All other instances of backslash will be treated as a literal backslash

But this gives me trust issues:

// This backslash is not treated as a literal backslash:
php > var_dump("\x5C");
string(1) "\"

// Neither is this:
php > var_dump("\n");
string(1) "
"

Copy link
Contributor

@adamziel adamziel May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no, that was just me using the wrong quotes. We are good indeed:

php > var_dump('\x5C');
string(4) "\x5C"
php > var_dump('\n');
string(2) "\n"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're good here, but it's a good point to be a bit more explicit. I don't like the maps like '\n' => "\n", where at a quick sight, the left and right sides look the same.

I followed the $baskslash idea in 931c82b, and I think it looks better now.

/*
* Apply the replacements.
*
* It is important to use "strtr()" and not "str_replace()", because
Copy link
Contributor

@adamziel adamziel May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a brilliant find ❤️

* A backslash with any other character represents the character itself.
* That is, \x evaluates to x, \\ evaluates to \, and \🙂 evaluates to 🙂.
*/
$value = preg_replace( '/\\\\(.)/u', '$1', $value );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I feel nostalgic. The first time I've read how a quadruple backslash evaluates to a single backslash in preg_* functions was about 20 years ago in a PHP4 book. I'm old. 👴 Can we either document this or express this in a different way? Perhaps $escaped_backslash = preg_quote("\x5C"); and preg_replace( '/'. $escaped_backslash .'(.)/u', '$1', $value );? Maybe that's an overkill. Feel free to make any call here, I just want to make sure this gets brought up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I did that together with the other improvements in 931c82b.

Copy link
Contributor

@adamziel adamziel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thank you Jan! I'm approving provisionally – I'd still like to see a rigorous test case that directly targets the quote_mysql_utf8_string_literal method. I know it's private. Perhaps it's generic and useful enough to be exposed publicly? And if not, there are ways to test private methods, too (although protected ones are easier).

@JanJakes
Copy link
Contributor Author

JanJakes commented May 7, 2025

I'm approving provisionally – I'd still like to see a rigorous test case that directly targets the quote_mysql_utf8_string_literal method.

@adamziel I added a test in 5196a05. Are there any more cases we should cover there?

Otherwise, this should be ready now.

@JanJakes JanJakes requested a review from adamziel May 7, 2025 15:03
@adamziel
Copy link
Contributor

adamziel commented May 7, 2025

I left a nitpick about a comment, but the substance of the PR looks great. Thank you for additional tests!

@JanJakes JanJakes force-pushed the string-escaping branch from a6b20c5 to 20f82be Compare May 7, 2025 19:31
@JanJakes
Copy link
Contributor Author

JanJakes commented May 7, 2025

@adamziel I improved the invalid UTF-8 tests and docs: 20f82be

I hope it makes more sense now.

@adamziel adamziel merged commit 278d41c into develop May 7, 2025
12 checks passed
@JanJakes JanJakes deleted the string-escaping branch May 8, 2025 05:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants