Skip to content

Commit 33e5884

Browse files
Update 07-wordEmbed_intro.md
1 parent 1062443 commit 33e5884

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

_episodes/07-wordEmbed_intro.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ wv = api.load('word2vec-google-news-300') # takes 3-10 minutes to load
3737
So far, we’ve seen how word counts (bag of words), TF-IDF, and LSA can help us embed a document or set of documents into useful vector spaces that allow us to gain insights from text data. Let's review the embeddings covered thus far...
3838
* **Word count embeddings**: Word count embeddings are a simple yet powerful method that represent text data as a sparse vector where each dimension corresponds to a unique word in the vocabulary, and the value in each dimension indicates the frequency of that word in the document. This approach disregards word order and context, treating each document as an unordered collection of words or tokens.
3939

40-
* **TF-IDF embeddings:** Term Frequency Inverse Document Frequency (TF-IDF) determines the mathematical significance of words across multiple documents. It's embedding is based on token/word frequency within each document and relative to how many documents a token appears in.
40+
* **TF-IDF embeddings:** Term Frequency Inverse Document Frequency (TF-IDF) is a fancier word-count method. It emphasizes words that are both frequent within a specific document *and* rare across the entire corpus.
4141

4242
* **LSA embeddings:** Latent Semantic Analysis (LSA) is used to find the hidden topics represented by a group of documents. It involves running singular-value decomposition (SVD) on a document-term matrix (typically the TF-IDF matrix), producing a vector representation of each document. This vector scores each document's representation in different topic/concept areas which are derived based on word co-occurences (e.g., 45% topic A, 35% topic B, and 20% topic C). Importantly, LSA is considered a *bag of words* method since the order of words in a document is not considered.
4343

0 commit comments

Comments
 (0)