Infixes Update Not Applying Properly to Tokenizer #13785

Rayan-Allali · 2025-04-02T14:58:15Z

Infixes Update Not Applying Properly to Tokenizer

Description

I tried updating the infix patterns in spaCy, but the changes are not applying correctly to the tokenizer. Specifically, I'm trying to modify how apostrophes and other symbols ( ') are handled. However, even after setting a new regex, the tokenizer does not reflect these changes.

Steps to Reproduce

Here are the two approaches I tried:

1️⃣ Removing apostrophe-related rules from infixes and recompiling:

default_infixes = [pattern for pattern in nlp.Defaults.infixes if "'" not in pattern]
infix_re = compile_infix_regex(default_infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer

Issue: Even after modifying the infix rules, contractions like "can't" still split incorrectly.

2️⃣ Manually adding new infix rules (including hyphens, plus signs, and dollar signs):

infixes = nlp.Defaults.infixes + [r"'",]  
infixe_regex = spacy.util.compile_infix_regex(infixes)  
nlp.tokenizer.infix_finditer = infixe_regex.finditer

Expected Behavior

The tokenizer should correctly apply the new infix rules.

Actual Behavior

Changes to nlp.tokenizer.infix_finditer do not seem to take effect.

Question

Am I missing something in how infix rules should be updated? Is there a correct way to override infix splitting?

Thanks for your help!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infixes Update Not Applying Properly to Tokenizer #13785

Infixes Update Not Applying Properly to Tokenizer #13785

Rayan-Allali commented Apr 2, 2025 •

edited

Loading

Infixes Update Not Applying Properly to Tokenizer #13785

Infixes Update Not Applying Properly to Tokenizer #13785

Comments

Rayan-Allali commented Apr 2, 2025 • edited Loading