Update 07-wordEmbed_intro.md

qualiaMachine · web-flow · commit 33e58846e3aa · 2024-04-15T15:35:09.000-05:00
diff --git a/_episodes/07-wordEmbed_intro.md b/_episodes/07-wordEmbed_intro.md
@@ -37,7 +37,7 @@ wv = api.load('word2vec-google-news-300') # takes 3-10 minutes to load
 So far, we’ve seen how word counts (bag of words), TF-IDF, and LSA can help us embed a document or set of documents into useful vector spaces that allow us to gain insights from text data. Let's review the embeddings covered thus far...
 * **Word count embeddings**: Word count embeddings are a simple yet powerful method that represent text data as a sparse vector where each dimension corresponds to a unique word in the vocabulary, and the value in each dimension indicates the frequency of that word in the document. This approach disregards word order and context, treating each document as an unordered collection of words or tokens.
   
-* **TF-IDF embeddings:** Term Frequency Inverse Document Frequency (TF-IDF) determines the mathematical significance of words across multiple documents. It's embedding is based on token/word frequency within each document and relative to how many documents a token appears in. 
+* **TF-IDF embeddings:** Term Frequency Inverse Document Frequency (TF-IDF) is a fancier word-count method. It emphasizes words that are both frequent within a specific document *and* rare across the entire corpus.
 
 * **LSA embeddings:** Latent Semantic Analysis (LSA) is used to find the hidden topics represented by a group of documents. It involves running singular-value decomposition (SVD) on a document-term matrix (typically the TF-IDF matrix), producing a vector representation of each document. This vector scores each document's representation in different topic/concept areas which are derived based on word co-occurences (e.g., 45% topic A, 35% topic B, and 20% topic C). Importantly, LSA is considered a *bag of words* method since the order of words in a document is not considered.