You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried updating the infix patterns in spaCy, but the changes are not applying correctly to the tokenizer. Specifically, I'm trying to modify how apostrophes and other symbols ( ') are handled. However, even after setting a new regex, the tokenizer does not reflect these changes.
Steps to Reproduce
Here are the two approaches I tried:
1️⃣ Removing apostrophe-related rules from infixes and recompiling:
Infixes Update Not Applying Properly to Tokenizer
Description
I tried updating the infix patterns in spaCy, but the changes are not applying correctly to the tokenizer. Specifically, I'm trying to modify how apostrophes and other symbols (
'
) are handled. However, even after setting a new regex, the tokenizer does not reflect these changes.Steps to Reproduce
Here are the two approaches I tried:
1️⃣ Removing apostrophe-related rules from
infixes
and recompiling:Issue: Even after modifying the infix rules, contractions like
"can't"
still split incorrectly.2️⃣ Manually adding new infix rules (including hyphens, plus signs, and dollar signs):
Expected Behavior
Actual Behavior
nlp.tokenizer.infix_finditer
do not seem to take effect.Question
Am I missing something in how infix rules should be updated? Is there a correct way to override infix splitting?
Thanks for your help!
The text was updated successfully, but these errors were encountered: