You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: episodes/03-preprocessing.md
+27-4Lines changed: 27 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -409,26 +409,49 @@ distress
409
409
vex
410
410
```
411
411
412
-
To help make this quick for all the text in all our books, we'll use a helper function we prepared for learners to use our tokenizer, do the casing and lemmatization we discussed earlier, and write the results to a file:
412
+
To help make this *relatively*quick for all the text in all our books, we'll use a helper function we prepared for learners to use our tokenizer, do the casing and lemmatization we discussed earlier, and write the results to a file:
This process may take several minutes to run. Doing this preprocessing now however will save us much, much time later.
419
+
This process may take several minutes to run. If you don't want to wait, you can stop the cell running and use our pre-baked solution (lemma files) found in data/book_lemmas. The next section will walk you through both options.
420
420
421
-
## Saving Our Progress
421
+
## Creating dataframe to work with files and lemmas easily
422
422
423
-
Let's save our progress by storing a spreadsheet (```*.csv``` or ```*.xlsx``` file) that lists all our authors, books, and associated filenames, both the original and lemmatized copies.
423
+
Let's save a dataframe / spreadsheet that lists all our authors, books, and associated filenames, both the original and lemmatized copies.
424
424
425
425
We'll use another helper we prepared to make this easy:
0 commit comments