Skip to content

Commit bf9f78c

Browse files
format equation a bit more
1 parent 6482828 commit bf9f78c

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

_episodes/05-tf-idf-documentEmbeddings.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -57,18 +57,18 @@ One method for constructing more advanced word embeddings is a model called TF-I
5757

5858
TF-IDF stands for term frequency-inverse document frequency and can be calculated for each document, *d*, and term, *t*, in a corpus. The calculation consists of two parts: term frequency and inverse document frequency. We multiply the two terms to get the TF-IDF value.
5959

60-
**Term frequency(t,d)** is a measure for how frequently a term, *t*, occurs in a document, *d*. The simplest way to calculate term frequency is by simply adding up the number of times a term occurs in a document, and dividing by the total word count in the document.
60+
**Term frequency(*t*,*d*)** is a measure for how frequently a term, *t*, occurs in a document, *d*. The simplest way to calculate term frequency is by simply adding up the number of times a term occurs in a document, and dividing by the total word count in the document.
6161

6262
**Inverse document frequency** measures a term's importance. Document frequency is the number of documents, *N*, a term occurs in, so inverse document frequency gives higher scores to words that occur in fewer documents.
6363
This is represented by the equation:
6464

65-
IDF(t) = ln[(N+1) / (DF(T)+1)]
65+
IDF(*t*) = ln[(*N*+1) / (DF(*t*)+1)]
6666

6767
where...
68-
* N represents the total number of documents in the corpus
69-
* DF(t) represents document frequency for a particular term/word, t. This is the number of documents a term occurs in.
68+
* *N* represents the total number of documents in the corpus
69+
* DF(*t*) represents document frequency for a particular term/word, *t*. This is the number of documents a term occurs in.
7070

71-
The key thing to understand is that words that occur in many documents produce smaller IDF values since the denominator grows with DF(x).
71+
The key thing to understand is that words that occur in many documents produce smaller IDF values since the denominator grows with DF(*t*).
7272

7373
We can also embed documents in vector space using TF-IDF scores rather than simple word counts. This also weakens the impact of stop-words, since due to their common nature, they have very low scores.
7474

0 commit comments

Comments
 (0)