|
153 | 153 | "The IMDB movie review set can be downloaded from [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/).\n",
|
154 | 154 | "After downloading the dataset, decompress the files.\n",
|
155 | 155 | "\n",
|
156 |
| - "A) If you are working with Linux or MacOS X, open a new terminal windowm `cd` into the download directory and execute \n", |
| 156 | + "A) If you are working with Linux or MacOS X, open a new terminal window, `cd` into the download directory and execute \n", |
157 | 157 | "\n",
|
158 | 158 | "`tar -zxf aclImdb_v1.tar.gz`\n",
|
159 | 159 | "\n",
|
|
522 | 522 | "cell_type": "markdown",
|
523 | 523 | "metadata": {},
|
524 | 524 | "source": [
|
525 |
| - "As we can see from executing the preceding command, the vocabulary is stored in a Python dictionary, which maps the unique words that are mapped to integer indices. Next let us print the feature vectors that we just created:" |
| 525 | + "As we can see from executing the preceding command, the vocabulary is stored in a Python dictionary, which maps the unique words to integer indices. Next let us print the feature vectors that we just created:" |
526 | 526 | ]
|
527 | 527 | },
|
528 | 528 | {
|
529 | 529 | "cell_type": "markdown",
|
530 | 530 | "metadata": {},
|
531 | 531 | "source": [
|
532 |
| - "Each index position in the feature vectors shown here corresponds to the integer values that are stored as dictionary items in the CountVectorizer vocabulary. For example, the rst feature at index position 0 resembles the count of the word and, which only occurs in the last document, and the word is at index position 1 (the 2nd feature in the document vectors) occurs in all three sentences. Those values in the feature vectors are also called the raw term frequencies: *tf (t,d)*—the number of times a term t occurs in a document *d*." |
| 532 | + "Each index position in the feature vectors shown here corresponds to the integer values that are stored as dictionary items in the CountVectorizer vocabulary. For example, the first feature at index position 0 resembles the count of the word \"and\", which only occurs in the last document, and the word \"is\" at index position 1 (the 2nd feature in the document vectors) occurs in all three sentences. Those values in the feature vectors are also called the raw term frequencies: *tf (t,d)*—the number of times a term t occurs in a document *d*." |
533 | 533 | ]
|
534 | 534 | },
|
535 | 535 | {
|
|
578 | 578 | "cell_type": "markdown",
|
579 | 579 | "metadata": {},
|
580 | 580 | "source": [
|
581 |
| - "When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information. In this subsection, we will learn about a useful technique called term frequency-inverse document frequency (tf-idf) that can be used to downweight those frequently occurring words in the feature vectors. The tf-idf can be de ned as the product of the term frequency and the inverse document frequency:\n", |
| 581 | + "When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information. In this subsection, we will learn about a useful technique called term frequency-inverse document frequency (tf-idf) that can be used to downweigh those frequently occurring words in the feature vectors. The tf-idf can be defined as the product of the term frequency and the inverse document frequency:\n", |
582 | 582 | "\n",
|
583 | 583 | "$$\\text{tf-idf}(t,d)=\\text{tf (t,d)}\\times \\text{idf}(t,d)$$\n",
|
584 | 584 | "\n",
|
|
621 | 621 | "cell_type": "markdown",
|
622 | 622 | "metadata": {},
|
623 | 623 | "source": [
|
624 |
| - "As we saw in the previous subsection, the word is had the largest term frequency in the 3rd document, being the most frequently occurring word. However, after transforming the same feature vector into tf-idfs, we see that the word is is\n", |
625 |
| - "now associated with a relatively small tf-idf (0.45) in document 3 since it is\n", |
626 |
| - "also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.\n" |
| 624 | + "As we saw in the previous subsection, the word \"is\" had the largest term frequency in the 3rd document, being the most frequently occurring word. However, after transforming the same feature vector into tf-idfs, we see that the word \"is\" is now associated with a relatively small tf-idf (0.45) in document 3 since it is also contained in documents 1 and 2 and thus is unlikely to contain any useful, discriminatory information.\n" |
627 | 625 | ]
|
628 | 626 | },
|
629 | 627 | {
|
630 | 628 | "cell_type": "markdown",
|
631 | 629 | "metadata": {},
|
632 | 630 | "source": [
|
633 |
| - "However, if we'd manually calculated the tf-idfs of the individual terms in our feature vectors, we'd have noticed that the `TfidfTransformer` calculates the tf-idfs slightly differently compared to the standard textbook equations that we de ned earlier. The equations for the idf and tf-idf that were implemented in scikit-learn are:" |
| 631 | + "However, if we'd manually calculated the tf-idfs of the individual terms in our feature vectors, we'd have noticed that the `TfidfTransformer` calculates the tf-idfs slightly differently compared to the standard textbook equations that we defined earlier. The equations for the idf and tf-idf that were implemented in scikit-learn are:" |
634 | 632 | ]
|
635 | 633 | },
|
636 | 634 | {
|
|
649 | 647 | "\n",
|
650 | 648 | "$$v_{\\text{norm}} = \\frac{v}{||v||_2} = \\frac{v}{\\sqrt{v_{1}^{2} + v_{2}^{2} + \\dots + v_{n}^{2}}} = \\frac{v}{\\big (\\sum_{i=1}^{n} v_{i}^{2}\\big)^\\frac{1}{2}}$$\n",
|
651 | 649 | "\n",
|
652 |
| - "To make sure that we understand how TfidfTransformer works, let us walk\n", |
653 |
| - "through an example and calculate the tf-idf of the word is in the 3rd document.\n", |
| 650 | + "To make sure that we understand how `TfidfTransformer` works, let us walk through an example and calculate the tf-idf of the word \"is\" in the 3rd document.\n", |
654 | 651 | "\n",
|
655 |
| - "The word is has a term frequency of 3 (tf = 3) in document 3 ($d_3$), and the document frequency of this term is 3 since the term is occurs in all three documents (df = 3). Thus, we can calculate the idf as follows:\n", |
| 652 | + "The word \"is\" has a term frequency of 3 (tf = 3) in document 3 ($d_3$), and the document frequency of this term is 3 since the term \"is\" occurs in all three documents (df = 3). Thus, we can calculate the idf as follows:\n", |
656 | 653 | "\n",
|
657 | 654 | "$$\\text{idf}(\"is\", d_3) = log \\frac{1+3}{1+3} = 0$$\n",
|
658 | 655 | "\n",
|
|
686 | 683 | "cell_type": "markdown",
|
687 | 684 | "metadata": {},
|
688 | 685 | "source": [
|
689 |
| - "If we repeated these calculations for all terms in the 3rd document, we'd obtain the following tf-idf vectors: [3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]. However, we notice that the values in this feature vector are different from the values that we obtained from the TfidfTransformer that we used previously. The nal step that we are missing in this tf-idf calculation is the L2-normalization, which can be applied as follows:" |
| 686 | + "If we repeated these calculations for all terms in the 3rd document, we'd obtain the following tf-idf vectors: [3.39, 3.0, 3.39, 1.29, 1.29, 1.29, 2.0 , 1.69, 1.29]. However, we notice that the values in this feature vector are different from the values that we obtained from the `TfidfTransformer` that we used previously. The final step that we are missing in this tf-idf calculation is the L2-normalization, which can be applied as follows:" |
690 | 687 | ]
|
691 | 688 | },
|
692 | 689 | {
|
|
1286 | 1283 | "cell_type": "markdown",
|
1287 | 1284 | "metadata": {},
|
1288 | 1285 | "source": [
|
1289 |
| - "As we can see, the result above is consistent with the average score computed the `cross_val_score`." |
| 1286 | + "As we can see, the result above is consistent with the average score computed with `cross_val_score`." |
1290 | 1287 | ]
|
1291 | 1288 | },
|
1292 | 1289 | {
|
|
1841 | 1838 | "name": "python",
|
1842 | 1839 | "nbconvert_exporter": "python",
|
1843 | 1840 | "pygments_lexer": "ipython3",
|
1844 |
| - "version": "3.9.7" |
| 1841 | + "version": "3.8.12" |
1845 | 1842 | },
|
1846 | 1843 | "toc": {
|
1847 | 1844 | "nav_menu": {},
|
|
0 commit comments