@@ -85,10 +85,27 @@ Earlier, we preprocessed our data to lemmatize each file in our corpus, then sav
85
85
86
86
Let's load our data back in to continue where we left off. First, we'll mount our google drive to get access to our data folder again.
87
87
88
+ ``` python
89
+ # Run this cell to mount your Google Drive.
90
+ from google.colab import drive
91
+ drive.mount(' /content/drive' )
92
+
93
+ # Show existing colab notebooks and helpers.py file
94
+ from os import listdir
95
+ wksp_dir = ' /content/drive/My Drive/Colab Notebooks/text-analysis/code'
96
+ listdir(wksp_dir)
97
+
98
+ # Add folder to colab's path so we can import the helper functions
99
+ import sys
100
+ sys.path.insert(0 , wksp_dir)
101
+ ```
102
+
103
+ Then, read the data.csv file we outputted in the last episode.
88
104
89
105
``` python
90
106
from pandas import read_csv
91
107
data = read_csv(" /content/drive/My Drive/Colab Notebooks/text-analysis/data/data.csv" )
108
+ data.head()
92
109
```
93
110
94
111
#### TD-IDF Vectorizer
@@ -109,10 +126,6 @@ tfidf = vectorizer.fit_transform(list(data["Lemma_File"]))
109
126
print (tfidf.shape)
110
127
```
111
128
112
- ``` output
113
- (41, 9879)
114
- ```
115
-
116
129
Here, ` tfidf.shape ` shows us the number of rows (books) and columns (words) are in our model.
117
130
118
131
::::::::::::::::::::::::::::::::::::::: challenge
@@ -145,20 +158,13 @@ Let's take a look at some of the words in our documents. Each of these represent
145
158
vectorizer.get_feature_names_out()[0 :5 ]
146
159
```
147
160
148
- ``` output
149
- array(['15th', '1st', 'aback', 'abandonment', 'abase'], dtype=object)
150
- ```
151
161
152
162
What is the weight of those words?
153
163
154
164
``` python
155
165
print (vectorizer.idf_[0 :5 ]) # weights for each token
156
166
```
157
167
158
- ``` output
159
- [2.79175947 2.94591015 2.25276297 2.25276297 2.43508453]
160
- ```
161
-
162
168
Let's show the weight for all the words:
163
169
164
170
``` python
@@ -167,41 +173,12 @@ tfidf_data = DataFrame(vectorizer.idf_, index=vectorizer.get_feature_names_out()
167
173
tfidf_data
168
174
```
169
175
170
- ``` output
171
- Weight
172
- 15th 2.791759
173
- 1st 2.945910
174
- aback 2.252763
175
- abandonment 2.252763
176
- abase 2.435085
177
- ... ...
178
- zealously 2.945910
179
- zenith 2.791759
180
- zest 2.791759
181
- zigzag 2.945910
182
- zone 2.791759
183
- ```
176
+ That was ordered alphabetically. Let's try from lowest to heighest weight:
184
177
185
178
``` python
186
179
tfidf_data.sort_values(by = " Weight" )
187
180
```
188
181
189
- That was ordered alphabetically. Let's try from lowest to heighest weight:
190
-
191
- ``` output
192
- Weight
193
- unaccountable 1.518794
194
- nest 1.518794
195
- needless 1.518794
196
- hundred 1.518794
197
- hunger 1.518794
198
- ... ...
199
- incurably 2.945910
200
- indecent 2.945910
201
- indeed 2.945910
202
- incantation 2.945910
203
- gentlest 2.945910
204
- ```
205
182
206
183
::::::::::::::::::::::::::::::::::::::::: callout
207
184
0 commit comments