Skip to content

Commit d8db24c

Browse files
Fix equation for TF calculation. Should divide by number of words in document -- not the corpus.
1 parent 33e5884 commit d8db24c

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

_episodes/05-tf-idf-documentEmbeddings.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -57,16 +57,16 @@ One method for constructing more advanced word embeddings is a model called TF-I
5757

5858
TF-IDF stands for term frequency-inverse document frequency. The model consists of two parts: term frequency and inverse document frequency. We multiply the two terms to get the TF-IDF value.
5959

60-
Term frequency is a measure how frequently a term occurs in a document. The simplest way to calculate term frequency is by simply adding up the number of times a term occurs in a document, and dividing by the total word count in the corpus.
60+
**Term frequency(t,d)** is a measure how frequently a term, *t*, occurs in a document, *d*. The simplest way to calculate term frequency is by simply adding up the number of times a term occurs in a document, and dividing by the total word count in the document.
6161

62-
Inverse document frequency measures a term's importance. Document frequency is the number of documents a term occurs in, so inverse document frequency gives higher scores to words that occur in fewer documents.
62+
**Inverse document frequency** measures a term's importance. Document frequency is the number of documents, *N*, a term occurs in, so inverse document frequency gives higher scores to words that occur in fewer documents.
6363
This is represented by the equation:
6464

65-
IDF(x) = ln[(N+1) / (DF(x)+1)]
65+
IDF(x) = ln[(N+1) / (DF(T)+1)]
6666

6767
where...
6868
* N represents the total number of documents in the corpus
69-
* DF(x) represents document frequency for a particular term/word, x. This is the number of documents a term occurs in.
69+
* DF(t) represents document frequency for a particular term/word, t. This is the number of documents a term occurs in.
7070

7171
The key thing to understand is that words that occur in many documents produce smaller IDF values since the denominator grows with DF(x).
7272

0 commit comments

Comments
 (0)