Skip to content

Infixes Update Not Applying Properly to Tokenizer #13785

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Rayan-Allali opened this issue Apr 2, 2025 · 0 comments
Open

Infixes Update Not Applying Properly to Tokenizer #13785

Rayan-Allali opened this issue Apr 2, 2025 · 0 comments

Comments

@Rayan-Allali
Copy link

Rayan-Allali commented Apr 2, 2025

Infixes Update Not Applying Properly to Tokenizer

Description

I tried updating the infix patterns in spaCy, but the changes are not applying correctly to the tokenizer. Specifically, I'm trying to modify how apostrophes and other symbols ( ') are handled. However, even after setting a new regex, the tokenizer does not reflect these changes.

Steps to Reproduce

Here are the two approaches I tried:

1️⃣ Removing apostrophe-related rules from infixes and recompiling:

default_infixes = [pattern for pattern in nlp.Defaults.infixes if "'" not in pattern]
infix_re = compile_infix_regex(default_infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer  

Issue: Even after modifying the infix rules, contractions like "can't" still split incorrectly.

2️⃣ Manually adding new infix rules (including hyphens, plus signs, and dollar signs):

infixes = nlp.Defaults.infixes + [r"'",]  
infixe_regex = spacy.util.compile_infix_regex(infixes)  
nlp.tokenizer.infix_finditer = infixe_regex.finditer

Expected Behavior

  • The tokenizer should correctly apply the new infix rules.

Actual Behavior

  • Changes to nlp.tokenizer.infix_finditer do not seem to take effect.

Question

Am I missing something in how infix rules should be updated? Is there a correct way to override infix splitting?

Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant