Skip to content

Commit 09d65d9

Browse files
Update 03-preprocessing.md
1 parent 5681be6 commit 09d65d9

File tree

1 file changed

+1
-258
lines changed

1 file changed

+1
-258
lines changed

episodes/03-preprocessing.md

Lines changed: 1 addition & 258 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ drive.mount('/content/drive')
8585

8686
# Show existing colab notebooks and helpers.py file
8787
from os import listdir
88-
wksp_dir = '/content/drive/My Drive/Colab Notebooks/text-analysis'
88+
wksp_dir = '/content/drive/My Drive/Colab Notebooks/text-analysis/code'
8989
listdir(wksp_dir)
9090

9191
# Add folder to colab's path so we can import the helper functions
@@ -126,21 +126,13 @@ corpus_file_list = create_file_list(corpus_dir)
126126
print(corpus_file_list)
127127
```
128128

129-
```txt
130-
['/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-olivertwist.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/chesterton-knewtoomuch.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dumas-tenyearslater.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dumas-twentyyearsafter.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-pride.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-taleoftwocities.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/chesterton-whitehorse.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-hardtimes.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-emma.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/chesterton-thursday.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dumas-threemusketeers.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/chesterton-ball.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-ladysusan.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-persuasion.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/melville-conman.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/chesterton-napoleon.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/chesterton-brown.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dumas-maninironmask.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dumas-blacktulip.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-greatexpectations.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-ourmutualfriend.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-sense.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-christmascarol.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-davidcopperfield.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-pickwickpapers.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/melville-bartleby.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-bleakhouse.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dumas-montecristo.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-northanger.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/melville-moby_dick.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/shakespeare-twelfthnight.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/melville-typee.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/shakespeare-romeo.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/melville-omoo.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/melville-piazzatales.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/shakespeare-muchado.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/shakespeare-midsummer.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/shakespeare-lear.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/melville-pierre.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/shakespeare-caesar.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/shakespeare-othello.txt']
131-
```
132-
133129
We will use the full corpus later, but it might be useful to filter to just a few specific files. For example, if I want just documents written by Austen, I can filter on part of the file path name:
134130

135131
```python
136132
austen_list = create_file_list(corpus_dir, 'austen*')
137133
print(austen_list)
138134
```
139135

140-
```txt
141-
['/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-pride.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-emma.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-ladysusan.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-persuasion.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-sense.txt', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-northanger.txt']
142-
```
143-
144136
Let's take a closer look at Emma. We are looking at the first full sentence, which begins with character 50 and ends at character 290.
145137

146138
```python
@@ -154,14 +146,6 @@ with open(emmapath, 'r') as f:
154146
print(sentence)
155147
```
156148

157-
```txt
158-
/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-emma.txt
159-
Emma Woodhouse, handsome, clever, and rich, with a comfortable home
160-
and happy disposition, seemed to unite some of the best blessings
161-
of existence; and had lived nearly twenty-one years in the world
162-
with very little to distress or vex her.
163-
```
164-
165149
## Preprocessing
166150

167151
Currently, our data is still in a format that is best for humans to read. Humans, without having to think too consciously about it, understand how words and sentences group up and divide into discrete units of meaning. We also understand that the words *run*, *ran*, and *running* are just different grammatical forms of the same underlying concept. Finally, not only do we understand how punctuation affects the meaning of a text, we also can make sense of texts that have odd amounts or odd placements of punctuation.
@@ -214,64 +198,6 @@ for t in tokens:
214198
print(t.text)
215199
```
216200

217-
```text
218-
Emma
219-
Woodhouse
220-
,
221-
handsome
222-
,
223-
clever
224-
,
225-
and
226-
rich
227-
,
228-
with
229-
a
230-
comfortable
231-
home
232-
233-
234-
and
235-
happy
236-
disposition
237-
,
238-
seemed
239-
to
240-
unite
241-
some
242-
of
243-
the
244-
best
245-
blessings
246-
247-
248-
of
249-
existence
250-
;
251-
and
252-
had
253-
lived
254-
nearly
255-
twenty
256-
-
257-
one
258-
years
259-
in
260-
the
261-
world
262-
263-
264-
with
265-
very
266-
little
267-
to
268-
distress
269-
or
270-
vex
271-
her
272-
.
273-
```
274-
275201
The single sentence has been broken down into a set of tokens. Tokens in spacy aren't just strings: They're python objects with a variety of attributes. Full documentation for these attributes can be found at: <https://spacy.io/api/token>
276202

277203
### Stems and Lemmas
@@ -295,127 +221,13 @@ for t in tokens:
295221
print(t.lemma)
296222
```
297223

298-
```txt
299-
14931068470291635495
300-
17859265536816163747
301-
2593208677638477497
302-
7792995567492812500
303-
2593208677638477497
304-
5763234570816168059
305-
2593208677638477497
306-
2283656566040971221
307-
10580761479554314246
308-
2593208677638477497
309-
12510949447758279278
310-
11901859001352538922
311-
2973437733319511985
312-
12006852138382633966
313-
962983613142996970
314-
2283656566040971221
315-
244022080605231780
316-
3083117615156646091
317-
2593208677638477497
318-
15203660437495798636
319-
3791531372978436496
320-
1872149278863210280
321-
7000492816108906599
322-
886050111519832510
323-
7425985699627899538
324-
5711639017775284443
325-
451024245859800093
326-
962983613142996970
327-
886050111519832510
328-
4708766880135230039
329-
631425121691394544
330-
2283656566040971221
331-
14692702688101715474
332-
13874798850131827181
333-
16179521462386381682
334-
8304598090389628520
335-
9153284864653046197
336-
17454115351911680600
337-
14889849580704678361
338-
3002984154512732771
339-
7425985699627899538
340-
1703489418272052182
341-
962983613142996970
342-
12510949447758279278
343-
9548244504980166557
344-
9778055143417507723
345-
3791531372978436496
346-
14526277127440575953
347-
3740602843040177340
348-
14980716871601793913
349-
6740321247510922449
350-
12646065887601541794
351-
962983613142996970
352-
```
353-
354224
Spacy stores words by an ID number, and not as a full string, to save space in memory. Many spacy functions will return numbers and not words as you might expect. Fortunately, adding an underscore for spacy will return text representations instead. We will also add in the lower case function so that all words are lower case.
355225

356226
```python
357227
for t in tokens:
358228
print(str.lower(t.lemma_))
359229
```
360230

361-
```txt
362-
emma
363-
woodhouse
364-
,
365-
handsome
366-
,
367-
clever
368-
,
369-
and
370-
rich
371-
,
372-
with
373-
a
374-
comfortable
375-
home
376-
377-
378-
and
379-
happy
380-
disposition
381-
,
382-
seem
383-
to
384-
unite
385-
some
386-
of
387-
the
388-
good
389-
blessing
390-
391-
392-
of
393-
existence
394-
;
395-
and
396-
have
397-
live
398-
nearly
399-
twenty
400-
-
401-
one
402-
year
403-
in
404-
the
405-
world
406-
407-
408-
with
409-
very
410-
little
411-
to
412-
distress
413-
or
414-
vex
415-
she
416-
.
417-
```
418-
419231
Notice how words like *best* and *her* have been changed to their root words like *good* and *she*. Let's change our tokenizer to save the lower cased, lemmatized versions of words instead of the original words.
420232

421233
```python
@@ -439,10 +251,6 @@ from spacy.lang.en.stop_words import STOP_WORDS
439251
print(STOP_WORDS)
440252
```
441253

442-
```txt
443-
{''s', 'must', 'again', 'had', 'much', 'a', 'becomes', 'mostly', 'once', 'should', 'anyway', 'call', 'front', 'whence', ''ll', 'whereas', 'therein', 'himself', 'within', 'ourselves', 'than', 'they', 'toward', 'latterly', 'may', 'what', 'her', 'nowhere', 'so', 'whenever', 'herself', 'other', 'get', 'become', 'namely', 'done', 'could', 'although', 'which', 'fifteen', 'seems', 'hereafter', 'whereafter', 'two', "'ve", 'to', 'his', 'one', ''d', 'forty', 'being', 'i', 'four', 'whoever', 'somehow', 'indeed', 'that', 'afterwards', 'us', 'she', "'d", 'herein', ''ll', 'keep', 'latter', 'onto', 'just', 'too', "'m", ''re', 'you', 'no', 'thereby', 'various', 'enough', 'go', 'myself', 'first', 'seemed', 'up', 'until', 'yourselves', 'while', 'ours', 'can', 'am', 'throughout', 'hereupon', 'whereupon', 'somewhere', 'fifty', 'those', 'quite', 'together', 'wherein', 'because', 'itself', 'hundred', 'neither', 'give', 'alone', 'them', 'nor', 'as', 'hers', 'into', 'is', 'several', 'thus', 'whom', 'why', 'over', 'thence', 'doing', 'own', 'amongst', 'thereupon', 'otherwise', 'sometime', 'for', 'full', 'anyhow', 'nine', 'even', 'never', 'your', 'who', 'others', 'whole', 'hereby', 'ever', 'or', 'and', 'side', 'though', 'except', 'him', 'now', 'mine', 'none', 'sixty', "n't", 'nobody', ''m', 'well', "'s", 'then', 'part', 'someone', 'me', 'six', 'less', 'however', 'make', 'upon', ''s', ''re', 'back', 'did', 'during', 'when', ''d', 'perhaps', "'re", 'we', 'hence', 'any', 'our', 'cannot', 'moreover', 'along', 'whither', 'by', 'such', 'via', 'against', 'the', 'most', 'but', 'often', 'where', 'each', 'further', 'whereby', 'ca', 'here', 'he', 'regarding', 'every', 'always', 'are', 'anywhere', 'wherever', 'using', 'there', 'anyone', 'been', 'would', 'with', 'name', 'some', 'might', 'yours', 'becoming', 'seeming', 'former', 'only', 'it', 'became', 'since', 'also', 'beside', 'their', 'else', 'around', 're', 'five', 'an', 'anything', 'please', 'elsewhere', 'themselves', 'everyone', 'next', 'will', 'yourself', 'twelve', 'few', 'behind', 'nothing', 'seem', 'bottom', 'both', 'say', 'out', 'take', 'all', 'used', 'therefore', 'below', 'almost', 'towards', 'many', 'sometimes', 'put', 'were', 'ten', 'of', 'last', 'its', 'under', 'nevertheless', 'whatever', 'something', 'off', 'does', 'top', 'meanwhile', 'how', 'already', 'per', 'beyond', 'everything', 'not', 'thereafter', 'eleven', 'n't', 'above', 'eight', 'before', 'noone', 'besides', 'twenty', 'do', 'everywhere', 'due', 'empty', 'least', 'between', 'down', 'either', 'across', 'see', 'three', 'on', 'formerly', 'be', 'very', 'rather', 'made', 'has', 'this', 'move', 'beforehand', 'if', 'my', 'n't', "'ll", 'third', 'without', ''m', 'yet', 'after', 'still', 'same', 'show', 'in', 'more', 'unless', 'from', 'really', 'whether', ''ve', 'serious', 'these', 'was', 'amount', 'whose', 'have', 'through', 'thru', ''ve', 'about', 'among', 'another', 'at'}
444-
```
445-
446254
It's possible to add and remove words as well, for example, *zebra*:
447255

448256
```python
@@ -482,34 +290,6 @@ for token in tokens:
482290
print(str.lower(token.lemma_))
483291
```
484292

485-
```txt
486-
woodhouse
487-
handsome
488-
clever
489-
rich
490-
comfortable
491-
home
492-
493-
494-
happy
495-
disposition
496-
unite
497-
good
498-
blessing
499-
500-
501-
existence
502-
live
503-
nearly
504-
year
505-
world
506-
507-
508-
little
509-
distress
510-
vex
511-
```
512-
513293
Notice that because we added *emma* to our stopwords, she is not in our preprocessed sentence any more. Other stopwords are also missing such as numbers.
514294

515295
Let's filter out stopwords and punctuation from our custom tokenizer now as well:
@@ -541,36 +321,6 @@ for token in tokens:
541321
print(str.lower(token.lemma_)+" "+token.pos_)
542322
```
543323

544-
```txt
545-
woodhouse PROPN
546-
handsome ADJ
547-
clever ADJ
548-
rich ADJ
549-
comfortable ADJ
550-
home NOUN
551-
552-
SPACE
553-
happy ADJ
554-
disposition NOUN
555-
unite VERB
556-
good ADJ
557-
blessing NOUN
558-
559-
SPACE
560-
existence NOUN
561-
live VERB
562-
nearly ADV
563-
year NOUN
564-
world NOUN
565-
566-
SPACE
567-
little ADJ
568-
distress VERB
569-
vex VERB
570-
571-
SPACE
572-
```
573-
574324
Because our dataset is relatively small, we may find that character names and places weigh very heavily in our early models. We also have a number of blank or white space tokens, which we will also want to remove.
575325

576326
We will finish our special tokenizer by removing punctuation and proper nouns from our documents:
@@ -621,10 +371,6 @@ tokens = tokenizer(sentence)
621371
print(tokens)
622372
```
623373

624-
```txt
625-
['handsome', 'clever', 'rich', 'comfortable', 'home', 'happy', 'disposition', 'unite', 'good', 'blessing', 'existence', 'live', 'nearly', 'year', 'world', 'little', 'distress', 'vex']
626-
```
627-
628374
## Putting it All Together
629375

630376
Now that we've built a tokenizer we're happy with, lets use it to create lemmatized versions of all the books in our corpus.
@@ -670,9 +416,6 @@ from helpers import lemmatize_files
670416
lemma_file_list = lemmatize_files(tokenizer, corpus_file_list)
671417
```
672418

673-
```txt
674-
['/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-olivertwist.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/chesterton-knewtoomuch.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dumas-tenyearslater.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dumas-twentyyearsafter.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-pride.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-taleoftwocities.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/chesterton-whitehorse.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-hardtimes.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-emma.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/chesterton-thursday.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dumas-threemusketeers.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/chesterton-ball.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-ladysusan.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-persuasion.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/melville-conman.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/chesterton-napoleon.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/chesterton-brown.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dumas-maninironmask.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dumas-blacktulip.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-greatexpectations.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-ourmutualfriend.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-sense.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-christmascarol.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-davidcopperfield.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-pickwickpapers.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/melville-bartleby.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dickens-bleakhouse.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/dumas-montecristo.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/austen-northanger.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/melville-moby_dick.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/shakespeare-twelfthnight.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/melville-typee.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/shakespeare-romeo.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/melville-omoo.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/melville-piazzatales.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/shakespeare-muchado.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/shakespeare-midsummer.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/shakespeare-lear.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/melville-pierre.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/shakespeare-caesar.txt.lemmas', '/content/drive/My Drive/Colab Notebooks/text-analysis/data/books/shakespeare-othello.txt.lemmas']
675-
```
676419
This process may take several minutes to run. Doing this preprocessing now however will save us much, much time later.
677420

678421
## Saving Our Progress

0 commit comments

Comments
 (0)