Skip to content

Commit 69ae8ce

Browse files
committed
add recent edits
1 parent 9acb776 commit 69ae8ce

File tree

1 file changed

+18
-41
lines changed

1 file changed

+18
-41
lines changed

episodes/05-tf-idf-documentEmbeddings.md

Lines changed: 18 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -85,10 +85,27 @@ Earlier, we preprocessed our data to lemmatize each file in our corpus, then sav
8585

8686
Let's load our data back in to continue where we left off. First, we'll mount our google drive to get access to our data folder again.
8787

88+
```python
89+
# Run this cell to mount your Google Drive.
90+
from google.colab import drive
91+
drive.mount('/content/drive')
92+
93+
# Show existing colab notebooks and helpers.py file
94+
from os import listdir
95+
wksp_dir = '/content/drive/My Drive/Colab Notebooks/text-analysis/code'
96+
listdir(wksp_dir)
97+
98+
# Add folder to colab's path so we can import the helper functions
99+
import sys
100+
sys.path.insert(0, wksp_dir)
101+
```
102+
103+
Then, read the data.csv file we outputted in the last episode.
88104

89105
```python
90106
from pandas import read_csv
91107
data = read_csv("/content/drive/My Drive/Colab Notebooks/text-analysis/data/data.csv")
108+
data.head()
92109
```
93110

94111
#### TD-IDF Vectorizer
@@ -109,10 +126,6 @@ tfidf = vectorizer.fit_transform(list(data["Lemma_File"]))
109126
print(tfidf.shape)
110127
```
111128

112-
```output
113-
(41, 9879)
114-
```
115-
116129
Here, `tfidf.shape` shows us the number of rows (books) and columns (words) are in our model.
117130

118131
::::::::::::::::::::::::::::::::::::::: challenge
@@ -145,20 +158,13 @@ Let's take a look at some of the words in our documents. Each of these represent
145158
vectorizer.get_feature_names_out()[0:5]
146159
```
147160

148-
```output
149-
array(['15th', '1st', 'aback', 'abandonment', 'abase'], dtype=object)
150-
```
151161

152162
What is the weight of those words?
153163

154164
```python
155165
print(vectorizer.idf_[0:5]) # weights for each token
156166
```
157167

158-
```output
159-
[2.79175947 2.94591015 2.25276297 2.25276297 2.43508453]
160-
```
161-
162168
Let's show the weight for all the words:
163169

164170
```python
@@ -167,41 +173,12 @@ tfidf_data = DataFrame(vectorizer.idf_, index=vectorizer.get_feature_names_out()
167173
tfidf_data
168174
```
169175

170-
```output
171-
Weight
172-
15th 2.791759
173-
1st 2.945910
174-
aback 2.252763
175-
abandonment 2.252763
176-
abase 2.435085
177-
... ...
178-
zealously 2.945910
179-
zenith 2.791759
180-
zest 2.791759
181-
zigzag 2.945910
182-
zone 2.791759
183-
```
176+
That was ordered alphabetically. Let's try from lowest to heighest weight:
184177

185178
```python
186179
tfidf_data.sort_values(by="Weight")
187180
```
188181

189-
That was ordered alphabetically. Let's try from lowest to heighest weight:
190-
191-
```output
192-
Weight
193-
unaccountable 1.518794
194-
nest 1.518794
195-
needless 1.518794
196-
hundred 1.518794
197-
hunger 1.518794
198-
... ...
199-
incurably 2.945910
200-
indecent 2.945910
201-
indeed 2.945910
202-
incantation 2.945910
203-
gentlest 2.945910
204-
```
205182

206183
::::::::::::::::::::::::::::::::::::::::: callout
207184

0 commit comments

Comments
 (0)