PacktPublishing
diff --git a/‎Chapter04/4.01 Hola Recurrent Neural Networks.ipynb
Lines changed: 95 additions & 0 deletions b/‎Chapter04/4.01 Hola Recurrent Neural Networks.ipynb
Lines changed: 95 additions & 0 deletions
diff --git a/‎Chapter04/4.05 Vanishing and Exploding Gradients.ipynb
Lines changed: 122 additions & 0 deletions b/‎Chapter04/4.05 Vanishing and Exploding Gradients.ipynb
Lines changed: 122 additions & 0 deletions
@@ -0,0 +1,95 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Hola Recurrent Neural Nets!\n",
+    "\n",
+    "\n",
+    "<center> The sun rises in the ____. </center>\n",
+    "\n",
+    "\n",
+    "If we were asked to predict the blank in the above sentence, we might probably predict as\n",
+    "'east'. How did we predict that the word 'east' would be the right word? Because we read\n",
+    "the whole sentence, understood the context and predicted that the word 'east' would be an\n",
+    "appropriate word here.\n",
+    "\n",
+    "If we use a feedforward neural network to predict the blank, it would not predict the right\n",
+    "word. This is due to the fact that in feedforward network, each input is independent of each\n",
+    "other and they make predictions only based on the current input and they don't remember\n",
+    "previous inputs.\n",
+    "\n",
+    "Thus, input to the network will be just the word before the blank which is, 'the'. With this\n",
+    "word alone as an input, our network cannot predict the correct word because it doesn't\n",
+    "know the context of the sentence which means - it doesn't know the previous set of words\n",
+    "to understand the context of the sentence and to predict an appropriate next word.\n",
+    "\n",
+    "Here is where we use Recurrent Neural networks. It predicts output not only based on the\n",
+    "current input but also on the previous hidden state. But why does it have to predict the\n",
+    "output based on the current input and the previous hidden state and why it can't just use\n",
+    "the current input and the previous input?\n",
+    "\n",
+    "Because the previous input will store information just about the previous word while the\n",
+    "previous hidden state captures the contextual information about all the words in the\n",
+    "sentence that the network has seen so far. Basically, the previous hidden state acts like a\n",
+    "memory and it captures the context of the sentence. With this context and the current input,\n",
+    "we can predict the relevant word.\n",
+    "\n",
+    "For instance, let us take the same sentence, The sun rises in the ____. As shown in the\n",
+    "following figure, we first pass the word 'the' as an input and then pass the next word 'sun'\n",
+    "as input but along with this we also pass the previous hidden state $h_0$. So every time, we\n",
+    "pass the input word - we also pass a previous hidden state.\n",
+    "\n",
+    "In the final step, we pass the word 'the' and also the previous hidden state $h_3$ which\n",
+    "captures the contextual information about the sequence of words that the network has seen\n",
+    "\n",
+    "so far. Thus, $h_3$ acts as memory and stores information about all the previous words that\n",
+    "the network has seen. With $h_3$ and the current input word 'the', we can now predict the\n",
+    "relevant next word. \n",
+    "\n",
+    "![image](images/1.png)\n",
+    "\n",
+    "_In a nutshell, RNN uses previous hidden state as memory which captures and stores the\n",
+    "information (inputs) that the network has seen so far._\n",
+    "\n",
+    "\n",
+    "\n",
+    "RNN is widely applied for use cases that involves sequential data like time series, text,\n",
+    "audio, speech, video, weather and many more. It has been greatly used in various Natural\n",
+    "Language Processing (NLP) tasks such as language translation, sentiment analysis, text\n",
+    "generation and so on. \n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the next section, we will learn about the difference between feedforward networks and RNN."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python [default]",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
@@ -0,0 +1,122 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Vanishing and Exploding Gradients\n",
+    "\n",
+    "\n",
+    "We just learned how backpropagation through time works and we saw how the gradient of\n",
+    "loss can be computed with respect to all the weights in RNN. But here we encounter a\n",
+    "problem called the vanishing and exploding gradients.\n",
+    "While computing derivatives of loss with respect to $W$  and $U$  , we saw that we have to\n",
+    "traverse all the way back to the first hidden state, as each hidden state at a time $t$ is\n",
+    "dependent on its previous hidden state at a time $t-1$ .\n",
+    "\n",
+    "\n",
+    "For instance, say we compute the gradient of loss $L_2$ with respect to $W$:\n",
+    "\n",
+    "$$ \\frac{\\partial L_{2}}{\\partial W}=\\frac{\\partial L_{2}}{\\partial y_{2}} \\frac{\\partial y_{2}}{\\partial h_{2}} \\frac{\\partial h_{2}}{\\partial W}$$"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you look at the term $\\frac{\\partial h_{2}}{\\partial W}$ from the above equation, we can't calculate derivative\n",
+    "of $h_2$ with respect to $W$ directly. Because if you see $h_{2}=\\tanh \\left(U x_{2}+W h_{1}\\right)$ is a \n",
+    "function which is dependent on $h_1$ and $W$. So we need to calculate derivative with respect\n",
+    "to $h_1$ as well. But even $h_{1}=\\tanh \\left(U x_{2}+W h_{0}\\right)_{i}$ is a function which is dependent on $h_0$\n",
+    " and $W$. Thus we need to calculate derivative with respect to $h_0$ as well. \n",
+    " \n",
+    " \n",
+    "As shown in the following figure - to compute the derivative of $L_2$ we need to go back all\n",
+    "the way to the initial hidden state $h_0$  as each hidden state is dependent on its previous\n",
+    "hidden state: \n",
+    "\n",
+    "![image](images/7.png)\n",
+    "\n",
+    "So to compute any loss $L_j$ we need to traverse all the way back to the initial hidden state\n",
+    " $h_0$ as each hidden state is dependent on its previous hidden state. Say we have a deep\n",
+    "recurrent network with 50 layers. To compute the loss $L_{50}$ we need to traverse all the way\n",
+    "back to $h_0$ as shown in the below figure.\n",
+    "\n",
+    "![image](images/8.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So exactly what is the problem here? While backpropagating towards the initial hidden\n",
+    "state we lose information and the RNN will not backpropagate perfectly. \n",
+    "\n",
+    "\n",
+    "Because remember $h_{t}=\\tanh \\left(U x_{t}+W h_{t-1}\\right)$ , every time we move backward, we\n",
+    "compute the derivative of $h_t$. A derivative of tanh is bounded to 1. We know that any two\n",
+    "values between 0 to 1, when multiplied with each other gives us a small number. We\n",
+    "usually initialize the weights of the network to a small number. When we multiply the\n",
+    "derivatives and weights while backpropagating we are essentially multiplying smaller\n",
+    "numbers.\n",
+    "\n",
+    "So when we multiply smaller numbers at every step while moving backward our gradient\n",
+    "becomes infinitesimally small and leads to a number which the computer can't handle and\n",
+    "this is called __vanishing gradient problem.__\n",
+    "\n",
+    "\n",
+    "Recall the gradient of the loss with respect of $W$ equation we saw in the previous section:\n",
+    "\n",
+    "![image](images/9.png)\n",
+    "\n",
+    "As you can see we are multiplying weights and derivative of the tanh function at every\n",
+    "time step. Repeated multiplication of these two leads to a small number and causes\n",
+    "vanishing gradients problem.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "Vanishing gradients problem occurs not only in RNN but also in other deep networks\n",
+    "where we use sigmoid or tanh as the activation function. So to overcome this we can use\n",
+    "ReLu as an activation function instead of tanh.\n",
+    "However, we have a variant of the RNN called LSTM network which can solve the\n",
+    "vanishing gradient problem effectively. We will see how it works in the next chapter.\n",
+    "Similarly, when we initialize the weights of the network to a very large number, gradients\n",
+    "would become very large at every step. While backpropagating we multiply a large\n",
+    "number together at every time step and it leads to infinity. This is called the __ Exploding\n",
+    "Gradient Problem. __"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the next section, we will learn how can we use gradient clipping to avoid vanishing gradients problem. "
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python [conda root]",
+   "language": "python",
+   "name": "conda-root-py"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}