Skip to content

Commit f33cc30

Browse files
authored
Code files added
1 parent 3e04089 commit f33cc30

13 files changed

+1474
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Hola Recurrent Neural Nets!\n",
8+
"\n",
9+
"\n",
10+
"<center> The sun rises in the ____. </center>\n",
11+
"\n",
12+
"\n",
13+
"If we were asked to predict the blank in the above sentence, we might probably predict as\n",
14+
"'east'. How did we predict that the word 'east' would be the right word? Because we read\n",
15+
"the whole sentence, understood the context and predicted that the word 'east' would be an\n",
16+
"appropriate word here.\n",
17+
"\n",
18+
"If we use a feedforward neural network to predict the blank, it would not predict the right\n",
19+
"word. This is due to the fact that in feedforward network, each input is independent of each\n",
20+
"other and they make predictions only based on the current input and they don't remember\n",
21+
"previous inputs.\n",
22+
"\n",
23+
"Thus, input to the network will be just the word before the blank which is, 'the'. With this\n",
24+
"word alone as an input, our network cannot predict the correct word because it doesn't\n",
25+
"know the context of the sentence which means - it doesn't know the previous set of words\n",
26+
"to understand the context of the sentence and to predict an appropriate next word.\n",
27+
"\n",
28+
"Here is where we use Recurrent Neural networks. It predicts output not only based on the\n",
29+
"current input but also on the previous hidden state. But why does it have to predict the\n",
30+
"output based on the current input and the previous hidden state and why it can't just use\n",
31+
"the current input and the previous input?\n",
32+
"\n",
33+
"Because the previous input will store information just about the previous word while the\n",
34+
"previous hidden state captures the contextual information about all the words in the\n",
35+
"sentence that the network has seen so far. Basically, the previous hidden state acts like a\n",
36+
"memory and it captures the context of the sentence. With this context and the current input,\n",
37+
"we can predict the relevant word.\n",
38+
"\n",
39+
"For instance, let us take the same sentence, The sun rises in the ____. As shown in the\n",
40+
"following figure, we first pass the word 'the' as an input and then pass the next word 'sun'\n",
41+
"as input but along with this we also pass the previous hidden state $h_0$. So every time, we\n",
42+
"pass the input word - we also pass a previous hidden state.\n",
43+
"\n",
44+
"In the final step, we pass the word 'the' and also the previous hidden state $h_3$ which\n",
45+
"captures the contextual information about the sequence of words that the network has seen\n",
46+
"\n",
47+
"so far. Thus, $h_3$ acts as memory and stores information about all the previous words that\n",
48+
"the network has seen. With $h_3$ and the current input word 'the', we can now predict the\n",
49+
"relevant next word. \n",
50+
"\n",
51+
"![image](images/1.png)\n",
52+
"\n",
53+
"_In a nutshell, RNN uses previous hidden state as memory which captures and stores the\n",
54+
"information (inputs) that the network has seen so far._\n",
55+
"\n",
56+
"\n",
57+
"\n",
58+
"RNN is widely applied for use cases that involves sequential data like time series, text,\n",
59+
"audio, speech, video, weather and many more. It has been greatly used in various Natural\n",
60+
"Language Processing (NLP) tasks such as language translation, sentiment analysis, text\n",
61+
"generation and so on. \n",
62+
"\n",
63+
"\n"
64+
]
65+
},
66+
{
67+
"cell_type": "markdown",
68+
"metadata": {},
69+
"source": [
70+
"In the next section, we will learn about the difference between feedforward networks and RNN."
71+
]
72+
}
73+
],
74+
"metadata": {
75+
"kernelspec": {
76+
"display_name": "Python [default]",
77+
"language": "python",
78+
"name": "python2"
79+
},
80+
"language_info": {
81+
"codemirror_mode": {
82+
"name": "ipython",
83+
"version": 2
84+
},
85+
"file_extension": ".py",
86+
"mimetype": "text/x-python",
87+
"name": "python",
88+
"nbconvert_exporter": "python",
89+
"pygments_lexer": "ipython2",
90+
"version": "2.7.11"
91+
}
92+
},
93+
"nbformat": 4,
94+
"nbformat_minor": 2
95+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Vanishing and Exploding Gradients\n",
8+
"\n",
9+
"\n",
10+
"We just learned how backpropagation through time works and we saw how the gradient of\n",
11+
"loss can be computed with respect to all the weights in RNN. But here we encounter a\n",
12+
"problem called the vanishing and exploding gradients.\n",
13+
"While computing derivatives of loss with respect to $W$ and $U$ , we saw that we have to\n",
14+
"traverse all the way back to the first hidden state, as each hidden state at a time $t$ is\n",
15+
"dependent on its previous hidden state at a time $t-1$ .\n",
16+
"\n",
17+
"\n",
18+
"For instance, say we compute the gradient of loss $L_2$ with respect to $W$:\n",
19+
"\n",
20+
"$$ \\frac{\\partial L_{2}}{\\partial W}=\\frac{\\partial L_{2}}{\\partial y_{2}} \\frac{\\partial y_{2}}{\\partial h_{2}} \\frac{\\partial h_{2}}{\\partial W}$$"
21+
]
22+
},
23+
{
24+
"cell_type": "markdown",
25+
"metadata": {},
26+
"source": [
27+
"If you look at the term $\\frac{\\partial h_{2}}{\\partial W}$ from the above equation, we can't calculate derivative\n",
28+
"of $h_2$ with respect to $W$ directly. Because if you see $h_{2}=\\tanh \\left(U x_{2}+W h_{1}\\right)$ is a \n",
29+
"function which is dependent on $h_1$ and $W$. So we need to calculate derivative with respect\n",
30+
"to $h_1$ as well. But even $h_{1}=\\tanh \\left(U x_{2}+W h_{0}\\right)_{i}$ is a function which is dependent on $h_0$\n",
31+
" and $W$. Thus we need to calculate derivative with respect to $h_0$ as well. \n",
32+
" \n",
33+
" \n",
34+
"As shown in the following figure - to compute the derivative of $L_2$ we need to go back all\n",
35+
"the way to the initial hidden state $h_0$ as each hidden state is dependent on its previous\n",
36+
"hidden state: \n",
37+
"\n",
38+
"![image](images/7.png)\n",
39+
"\n",
40+
"So to compute any loss $L_j$ we need to traverse all the way back to the initial hidden state\n",
41+
" $h_0$ as each hidden state is dependent on its previous hidden state. Say we have a deep\n",
42+
"recurrent network with 50 layers. To compute the loss $L_{50}$ we need to traverse all the way\n",
43+
"back to $h_0$ as shown in the below figure.\n",
44+
"\n",
45+
"![image](images/8.png)"
46+
]
47+
},
48+
{
49+
"cell_type": "markdown",
50+
"metadata": {},
51+
"source": [
52+
"So exactly what is the problem here? While backpropagating towards the initial hidden\n",
53+
"state we lose information and the RNN will not backpropagate perfectly. \n",
54+
"\n",
55+
"\n",
56+
"Because remember $h_{t}=\\tanh \\left(U x_{t}+W h_{t-1}\\right)$ , every time we move backward, we\n",
57+
"compute the derivative of $h_t$. A derivative of tanh is bounded to 1. We know that any two\n",
58+
"values between 0 to 1, when multiplied with each other gives us a small number. We\n",
59+
"usually initialize the weights of the network to a small number. When we multiply the\n",
60+
"derivatives and weights while backpropagating we are essentially multiplying smaller\n",
61+
"numbers.\n",
62+
"\n",
63+
"So when we multiply smaller numbers at every step while moving backward our gradient\n",
64+
"becomes infinitesimally small and leads to a number which the computer can't handle and\n",
65+
"this is called __vanishing gradient problem.__\n",
66+
"\n",
67+
"\n",
68+
"Recall the gradient of the loss with respect of $W$ equation we saw in the previous section:\n",
69+
"\n",
70+
"![image](images/9.png)\n",
71+
"\n",
72+
"As you can see we are multiplying weights and derivative of the tanh function at every\n",
73+
"time step. Repeated multiplication of these two leads to a small number and causes\n",
74+
"vanishing gradients problem.\n"
75+
]
76+
},
77+
{
78+
"cell_type": "markdown",
79+
"metadata": {},
80+
"source": [
81+
"\n",
82+
"Vanishing gradients problem occurs not only in RNN but also in other deep networks\n",
83+
"where we use sigmoid or tanh as the activation function. So to overcome this we can use\n",
84+
"ReLu as an activation function instead of tanh.\n",
85+
"However, we have a variant of the RNN called LSTM network which can solve the\n",
86+
"vanishing gradient problem effectively. We will see how it works in the next chapter.\n",
87+
"Similarly, when we initialize the weights of the network to a very large number, gradients\n",
88+
"would become very large at every step. While backpropagating we multiply a large\n",
89+
"number together at every time step and it leads to infinity. This is called the __ Exploding\n",
90+
"Gradient Problem. __"
91+
]
92+
},
93+
{
94+
"cell_type": "markdown",
95+
"metadata": {},
96+
"source": [
97+
"In the next section, we will learn how can we use gradient clipping to avoid vanishing gradients problem. "
98+
]
99+
}
100+
],
101+
"metadata": {
102+
"kernelspec": {
103+
"display_name": "Python [conda root]",
104+
"language": "python",
105+
"name": "conda-root-py"
106+
},
107+
"language_info": {
108+
"codemirror_mode": {
109+
"name": "ipython",
110+
"version": 2
111+
},
112+
"file_extension": ".py",
113+
"mimetype": "text/x-python",
114+
"name": "python",
115+
"nbconvert_exporter": "python",
116+
"pygments_lexer": "ipython2",
117+
"version": "2.7.11"
118+
}
119+
},
120+
"nbformat": 4,
121+
"nbformat_minor": 2
122+
}

0 commit comments

Comments
 (0)