At this point, you have become familiar with vanilla RNNs, which are powerful architectures, but they're limited in the sense that for long sequences of words, the information tends to vanish. However, there are more complex models that you can use to handle long sequences, like the Gated Recurrent Unit. Here, I'm going to introduce you to Gated Recurrent Units, GRUs for short, with a comparison to vanilla RNNs. One important difference is that GRUs work in a way that allows relevant information to be kept in the hidden state even over long sequences. For example, with a GRU, you'll be able to train a model that takes the sentence; ants are really interesting, blank, are everywhere, and easily predict the word, they, to fill in the blank because the GRU learn to keep the information about the subject, in this case, whether it is plural or singular in the hidden states. GRUs accomplish this by computing relevance and update gates, which I'll show you next. You can think of GRUs as vanilla RNNs with additional computations. They take two inputs at every time step, the variable X at time t, and the hidden state h, which is passed from the previous units. The first two computations made in a GRU are the relevance gates, Gamma subscripts r, and the updates gates, Gamma subscript u. These gates compute the sigmoid activation function. The result is a vector of values which have been squeezed to fit between zero and one. The update and relevance gates in GRUs, are the most important computations. Their outputs help determine which information from the previous hidden states is relevant, and which value should be updated with current information. After the relevance gates is computed, a candidate's h prime for the hidden states is found. Its computation takes as parameters the previous hidden state times the relevant gates, and the variable X for the current sign. This value stores all the candidates for information thus could override the one contained in the previous hidden states. After that, a new value for the hidden states is calculated using the information from the previous hidden states, the candidates' hidden states, and the updates gates. The updates gate determines how much of the information from the previous hidden state will be overwritten. Finally, a prediction y hat is computed using the current hidden states. Let's compare GRUs with vanilla RNNs. Remember that a vanilla RNN such as this one computes an activation function with the previous hidden states and currents variable X's parameters to get the current hidden state. With the current hidden state, another activation function is computed to get the current prediction y hat. This architecture is updating the hidden state at every time step, for long sequences, the information tends to vanish. This is one cause of the so-called vanishing gradients problem. On the other hand, GRUs compute significantly more operations, which can cause longer processing times and memory usage. Irrelevance and updates gates determine which information from the previous hidden state is relevant, and what information should be updated. The hidden state's candidates stores the information that could be used to override the one passed from the previous hidden states. Then the current hidden state is computed and updates some of the information from the last hidden states, and a prediction y hat is made with the updated hidden states. All of these computations allow the network to learn what type of information to keep and when to override it. I just demonstrated how GRUs decide which of the information in the hidden state to override at each time step. The relevant and updates gates and GRUs, allow models to keep some values in the hidden states for longer word sequences. This is very useful to many NLP tasks. GRUs are simplified versions of the popular LSTMs, which you'll be encountering a little later in this specialization. Now you know that GRUs are very similar to a simple RNN, except that they have two gates thus allow you to update the information in your state, and it tells you how relevant each input is. In the next video, we will talk about bidirectional and deep RNNs.