Skip to content

Commit 89f671e

Browse files
add pre-baked lemma option
1 parent d6c7de2 commit 89f671e

File tree

1 file changed

+27
-4
lines changed

1 file changed

+27
-4
lines changed

episodes/03-preprocessing.md

Lines changed: 27 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -409,26 +409,49 @@ distress
409409
vex
410410
```
411411

412-
To help make this quick for all the text in all our books, we'll use a helper function we prepared for learners to use our tokenizer, do the casing and lemmatization we discussed earlier, and write the results to a file:
412+
To help make this *relatively* quick for all the text in all our books, we'll use a helper function we prepared for learners to use our tokenizer, do the casing and lemmatization we discussed earlier, and write the results to a file:
413413

414414
```python
415415
from helpers import lemmatize_files
416416
lemma_file_list = lemmatize_files(tokenizer, corpus_file_list)
417417
```
418418

419-
This process may take several minutes to run. Doing this preprocessing now however will save us much, much time later.
419+
This process may take several minutes to run. If you don't want to wait, you can stop the cell running and use our pre-baked solution (lemma files) found in data/book_lemmas. The next section will walk you through both options.
420420

421-
## Saving Our Progress
421+
## Creating dataframe to work with files and lemmas easily
422422

423-
Let's save our progress by storing a spreadsheet (```*.csv``` or ```*.xlsx``` file) that lists all our authors, books, and associated filenames, both the original and lemmatized copies.
423+
Let's save a dataframe / spreadsheet that lists all our authors, books, and associated filenames, both the original and lemmatized copies.
424424

425425
We'll use another helper we prepared to make this easy:
426426

427427
```python
428428
from helpers import parse_into_dataframe
429429
pattern = "/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/{author}-{title}.txt"
430430
data = parse_into_dataframe(pattern, corpus_file_list)
431+
data.head()
432+
```
433+
434+
Next, we can add the lemma files to the dataframe. If you ran the lemmatize_files() function above successfully, you can use:
435+
```python
431436
data["Lemma_File"] = lemma_file_list
437+
data.head()
438+
```
439+
440+
Otherwise, we can add the "pre-baked" lemmas to our dataframe using
441+
442+
```python
443+
def get_lemma_path(file_path):
444+
# Convert to Path object for easier manipulation
445+
p = Path(file_path)
446+
# Extract the filename like 'austen-sense.txt'
447+
file_name = p.name
448+
# Create new path with 'book_lemmas' instead of 'books' and add .lemmas
449+
lemma_name = file_name + ".lemmas"
450+
return str(p.parent.parent / "book_lemmas" / lemma_name)
451+
452+
# Add new column
453+
data["Lemma_File"] = data["File"].apply(get_lemma_path)
454+
data.head()
432455
```
433456

434457
Finally, we'll save this table to a file:

0 commit comments

Comments
 (0)