RNN , LSTM and Bi Directional LSTM
Recurrent neural networks, also known as RNNs are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. Normal neural network don't have persistence. For example if you want to predict what is the next word in a sentence, human beings can easily understand each word based on understanding of previous words. Traditional neural network can’t do this , they cant remember the context. Recurrent neural network address this issue. RNN accepts an input vector x and give you an output vector y. However, crucially this output vector’s contents are influenced not only by the input you just fed in, but also on the entire history of inputs you’ve fed in in the past.
The RNN class has some internal state that it gets to update every time step
is called. In the simplest case this state consists of a single hidden vector h.
Here is a simple implementation of the step function in a Vanilla RNN
rnn = RNN()
y = rnn.step(x)
class RNN:
def step(self, x):
# update the hidden state
self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
# compute the output vector
y = np.dot(self.W_hy, self.h)
return y
The above specifies the forward pass of a vanilla RNN. This RNN’s parameters are the three matrices W_hh, W_xh, W_hy
. The hidden state self.h
is initialized with the zero vector. The np.tanh
function implements a non-linearity that squashes the activations to the range [-1, 1]
. There are two terms inside of the tanh, one is based on the previous hidden state and one is based on the current input. In numpy np.dot
is matrix multiplication. The two intermediates interact with addition, and then get squashed by the tanh into the new state vector. We can also write the hidden state update as
ht=tanh(W_hh h_t−1+W_xh x_t)
where tanh is applied elementwise. Notice that there are three weight metrices it is because there are two inputs one that the model has previously learned(W_hh) other is the input (W_xh), the third(W_hy) for the output.
RNN can be thought of as a network that have loops in them, which allows information to persist and also allows information to be passed from one step of network to other. RNN can be viewed as a same network having multiple copies of itself, each passing information to a successor.
Xt is the input and yt as output. As you can see in the above diagram this chain-like nature closely resembles sequence and lists. Here is the code for RNN using tensorflow ON MNIST dataset of 28x28 images of handwritten digits and their
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTMmnist = tf.keras.datasets.mnist#splitting dataset into train and testing
(x_train, y_train),(x_test, y_test) = mnist.load_data() # unpacks images to x_train/x_test and labels to y_train/y_test#normalizing values
x_train = x_train/255.0
x_test = x_test/255.0print(x_train.shape)
print(x_train[0].shape)model = Sequential()
model.add(LSTM(128, input_shape=(x_train.shape[1:]), activation='relu', return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(128, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(10, activation='softmax'))opt = tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)model.compile(loss='sparse_categorical_crossentropy',optimizer=opt,
metrics=['accuracy'],)model.fit(x_train,y_train,epochs=3,validation_data=(x_test, y_test))
RNN is used when you want to look at the recent information, for example we have a sentence and want to predict the last word “ fishes are swimming in the water .” it is obvious the last word is going to be water. In this case the gap between the information is small, RNN can learn to use the past information. But there are cases where we need more context. Consider this sentence “ I grew up in France (2000 words later) …. I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large. Unfortunately, as the gap grows, RNN becomes unable to learn to connect the information.
This is because the gradient of the loss function decays exponentially with time (called the vanishing gradient problem). LSTM networks are a type of RNN that uses special units in addition to standard units. LSTM units include a ‘memory cell’ that can maintain information in memory for long periods of time.
LSTM Network
Long Short Term Memory networks — usually just called “LSTMs” — are a special kind of RNN, capable of learning long-term dependencies
All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.
LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.
The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.
The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.
The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.
Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.
Bi-LSTM:(Bi-directional long short term memory):
Bidirectional recurrent neural networks(RNN) are really just putting two independent RNNs together. This structure allows the networks to have both backward and forward information about the sequence at every time step
Using bidirectional will run your inputs in two ways, one from past to future and one from future to past and what differs this approach from unidirectional is that in the LSTM that runs backward you preserve information from the future and using the two hidden states combined you are able in any point in time to preserve information from both past and future.
Refrences
http://introtodeeplearning.com/slides/6S191_MIT_DeepLearning_L2.pdf