Open
Description
Describe the bug
Control characters like \x1f
break German sentence segmentation at format_numbered_list_with_periods
step.
To Reproduce
Steps to reproduce the behavior:
Input text - '1.\x1f\x1fApfel\x1d2.\x1f\x1fBanana'
Code:
import pysbd
example_text = '1.\x1f\x1fApfel\x1d2.\x1f\x1fBanana'
segmenter = pysbd.Segmenter(language="de", clean=False, char_span=True)
sents_char_spans = segmenter.segment(example_text)
Expected behavior
Expected output:
['1.\x1f\x1fApfel\x1d', '2.\x1f\x1fBanana']
Additional context
pysbd version:
'0.3.4'
Python 3.8.10
Windows/Linux both tried
Traceback (most recent call last) ────────────────────────────────╮
│ in <module> │
│ │
│ 1 segmenter = pysbd.Segmenter(language="de", clean=False, char_span=True) │
│ ❱ 2 sents_char_spans = segmenter.segment(example_text) │
│ 3 │
│ │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\segme │
│ nter.py:87 in segment │
│ │
│ 84 │ │ if self.clean or self.doc_type == 'pdf': │
│ 85 │ │ │ text = self.cleaner(text).clean() │
│ 86 │ │ │
│ ❱ 87 │ │ postprocessed_sents = self.processor(text).process() │
│ 88 │ │ sentence_w_char_spans = self.sentences_with_char_spans(postprocessed_sents) │
│ 89 │ │ if self.char_span: │
│ 90 │ │ │ return sentence_w_char_spans │
│ │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\proce │
│ ssor.py:33 in process │
│ │
│ 30 │ │ │ return self.text │
│ 31 │ │ self.text = self.text.replace('\n', '\r') │
│ 32 │ │ li = ListItemReplacer(self.text) │
│ ❱ 33 │ │ self.text = li.add_line_break() │
│ 34 │ │ self.replace_abbreviations() │
│ 35 │ │ self.replace_numbers() │
│ 36 │ │ self.replace_continuous_punctuation() │
│ │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │
│ _item_replacer.py:61 in add_line_break │
│ │
│ 58 │ def add_line_break(self): │
│ 59 │ │ self.format_alphabetical_lists() │
│ 60 │ │ self.format_roman_numeral_lists() │
│ ❱ 61 │ │ self.format_numbered_list_with_periods() │
│ 62 │ │ self.format_numbered_list_with_parens() │
│ 63 │ │ return self.text │
│ 64 │
│ │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │
│ _item_replacer.py:80 in format_numbered_list_with_periods │
│ │
│ 77 │ │ │ │ │ │ '♨', strip=True) │
│ 78 │ │
│ 79 │ def format_numbered_list_with_periods(self): │
│ ❱ 80 │ │ self.replace_periods_in_numbered_list() │
│ 81 │ │ self.add_line_breaks_for_numbered_list_with_periods() │
│ 82 │ │ self.text = Text(self.text).apply(self.SubstituteListPeriodRule) │
│ 83 │
│ │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │
│ _item_replacer.py:76 in replace_periods_in_numbered_list │
│ │
│ 73 │ │ self.text = Text(self.text).apply(self.ListMarkerRule) │
│ 74 │ │
│ 75 │ def replace_periods_in_numbered_list(self): │
│ ❱ 76 │ │ self.scan_lists(self.NUMBERED_LIST_REGEX_1, self.NUMBERED_LIST_REGEX_2, │
│ 77 │ │ │ │ │ │ '♨', strip=True) │
│ 78 │ │
│ 79 │ def format_numbered_list_with_periods(self): │
│ │
│ C:\Users\ekaterina.loginova\AppData\Local\Programs\Python\Python38\lib\site-packages\pysbd\lists │
│ _item_replacer.py:114 in scan_lists │
│ │
│ 111 │ │
│ 112 │ def scan_lists(self, regex1, regex2, replacement, strip=False): │
│ 113 │ │ list_array = re.findall(regex1, self.text) │
│ ❱ 114 │ │ list_array = list(map(int, list_array)) │
│ 115 │ │ for ind, item in enumerate(list_array): │
│ 116 │ │ │ # to avoid IndexError │
│ 117 │ │ │ # ruby returns nil if index is out of range │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: invalid literal for int() with base 10: '\x1d2'
Metadata
Metadata
Assignees
Labels
No labels