You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _episodes/05-tf-idf-documentEmbeddings.md
+5-5Lines changed: 5 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -57,18 +57,18 @@ One method for constructing more advanced word embeddings is a model called TF-I
57
57
58
58
TF-IDF stands for term frequency-inverse document frequency and can be calculated for each document, *d*, and term, *t*, in a corpus. The calculation consists of two parts: term frequency and inverse document frequency. We multiply the two terms to get the TF-IDF value.
59
59
60
-
**Term frequency(t,d)** is a measure for how frequently a term, *t*, occurs in a document, *d*. The simplest way to calculate term frequency is by simply adding up the number of times a term occurs in a document, and dividing by the total word count in the document.
60
+
**Term frequency(*t*,*d*)** is a measure for how frequently a term, *t*, occurs in a document, *d*. The simplest way to calculate term frequency is by simply adding up the number of times a term occurs in a document, and dividing by the total word count in the document.
61
61
62
62
**Inverse document frequency** measures a term's importance. Document frequency is the number of documents, *N*, a term occurs in, so inverse document frequency gives higher scores to words that occur in fewer documents.
63
63
This is represented by the equation:
64
64
65
-
IDF(t) = ln[(N+1) / (DF(T)+1)]
65
+
IDF(*t*) = ln[(*N*+1) / (DF(*t*)+1)]
66
66
67
67
where...
68
-
*N represents the total number of documents in the corpus
69
-
* DF(t) represents document frequency for a particular term/word, t. This is the number of documents a term occurs in.
68
+
**N* represents the total number of documents in the corpus
69
+
* DF(*t*) represents document frequency for a particular term/word, *t*. This is the number of documents a term occurs in.
70
70
71
-
The key thing to understand is that words that occur in many documents produce smaller IDF values since the denominator grows with DF(x).
71
+
The key thing to understand is that words that occur in many documents produce smaller IDF values since the denominator grows with DF(*t*).
72
72
73
73
We can also embed documents in vector space using TF-IDF scores rather than simple word counts. This also weakens the impact of stop-words, since due to their common nature, they have very low scores.
0 commit comments