Introduction to Natural Language Processing using Tensorflow

According to Wikipedia,

Natural language processing is a sub-field of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

We can divide the NLP process roughly into 2 parts

  1. Preprocessing
  2. Training

Preprocessing

The most important part of natural language processing is embedding means How to represent sentences? for example below is if we label every single character in a word but here is the problem. both words SILENT and LISTEN have the same letters.

SILENT VS LISTEN

So another approach, represent words with labels. As given below, I001 is fixed while training and testing.

Tokenization of sentence

Encoding Sentences in Tensorflow / Keras

Let's import packages and libraries required.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences  # We will use it later

Now, let's take some sentences and tokenize them.

sentences = [
    'I love my dog',
    'I love my cat so much',
    'You love my dog!'
]

# Create object of Tokenizer and tokenize sentences
tokenizer = Tokenizer(num_words = 100)  # max distinct words first 100 words
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index  # returns dictionary
print(word_index)

output

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'so': 6, 'much': 7, 'you': 8}

As you can see,

  1. Words are lower-cased
  2. ! mark removed

Creating Sequences from sentences

Now we need a sequence of codes for each sentence so that we can pass these sequences to Neural Networks.

We already created the Tokenizer instance and passed our sentences to its method tokenizer.fit_on_texts(sentences).

# Let's append new sentence to `sentences`
sentences.append("I love my Tom")

sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)

output

[[3, 1, 2, 4], [3, 1, 2, 5, 6, 7], [8, 1, 2, 4], [3, 1, 2]]

But here we have another problem, What if the word is new to the dictionary?  see last output, our dictionary doesn't have the word Tom and TensorFlow just ignored it. And always it does same, it ignores these words. But it is unfair and maybe we can add some wild/special value for unseen words.

Let's start again

# at Tokenizer instance creation add another keyword argument
# OOV means Out of vocabulary
tokenizer = Tokenizer(num_words=100, oov_token='<OOv>')  

...... # do same as above
print(word_index)
print(sequence)

output

{'< OOv >': 1,'love': 2, 'my': 3, 'i': 4, 'dog': 5, 'cat': 6, 'so': 7, 'much': 8, 'you': 9}
[[4, 2, 3, 5], [4, 2, 3, 6, 7, 8], [9, 2, 3, 5], [4, 2, 3, 1 ]]

As you can see, Tom is not in corpus and it is assigned as <'OOV'>. But as our dataset increases, the possibility of missing words reduces. or We can use GloVe or word2vec embeddings. We will see later.

Padding

But still here is a problem, as we know neural networks work only with fixed-size input. so we need to pad these sequences/trim to a fixed length.

Until now we got sequences from our sentences with different lengths of each sequence. Let's pad them to the same length.

padded_sequences = pad_sequences(sequences, padding='post',truncating='post',maxlen=5)
print(padded_sequences)

output

[[4 2 3 5 0 ]
[4 2 3 6 7] last 'much' removed
[9 2 3 5 0 ]
[4 2 3 1 0 ]]

Training

We will create positive and negative review analysis systems from the IMDB dataset. For the setting thing, we need to install tensorflow-datasets or use google collab instead.

parikshit@Vostro-5568:~$ pip install -q tensorflow-datasets

Once tensorflow-datasets installed, load IMDB Dataset.

import tensorflow_datasets as tfds

imdb, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)

This block will download the dataset if not found and store it in the most efficient format TFrecords.

Split data for Training and Testing. and convert in the appropriate format as we need in the above example → Separate sentences and labels

import numpy as np

train_data, test_data = imdb['train'], imdb['test']

training_sentences = []
training_labels = []

testing_sentences = []
testing_labels = []

for sentence, label in train_data:
    training_sentences.append(str(sentence.numpy()))
    training_labels.append(label.numpy())

for sentence, label in test_data:
    testing_sentences.append(str(sentence.numpy()))
    testing_labels.append(label.numpy())

print("Number of training samples: " + str(len(training_sentences)))
print("Number of testing samples: " + str(len(testing_sentences)))

output

Number of training samples: 25000
Number of testing samples: 25000

Let's do some exploratory data analysis

# print 1st sentence and corrsponding label in traing data
print("Sentence: " + training_sentences[0])
print("Label: " +str(training_labels[0]))

output

Sentence: b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
Label: 0

Our training and testing data labels are in the list but for training, we need to convert them into NumPy array

training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

## check by printing shapes
print("The shape of training labels: " + str(training_labels_final.shape))
print("The shape of testing labels: " + str(testing_labels_final.shape))

output

The shape of training labels: (25000,)
The shape of testing labels: (25000,)

Preprocessing as we did above

We need to preprocess data as we did above. Code is the same as above.

## define params so that we can change easily
vocab_size = 100000
embedding_dim = 16
max_length = 120
trunc_type = 'post'
pad_type = 'post'
oov_tok = "<OOV>"

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)

word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(traing_sentences)
training_padded = pad_sequences(trainining_sequences, maxlen= max_length, padding=pad_type, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length,  padding=pad_type, truncating=trunc_type)

Let's check our training data and testing data shape

print("Training Input: " + str(training_sequences.shape))

print("Testing Input: " + str(testing_sequences.shape))

output

Training Input: (25000, 120)
Testing Input: (25000, 120)

Create Model in Keras

Keras provides a higher-level API for TensorFlow so it is too easy to create and train neural networks.

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6,activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

print(model.summary())

output

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding (Embedding)        (None, 120, 16)           1600000
_________________________________________________________________
flatten (Flatten)            (None, 1920)              0
_________________________________________________________________
dense (Dense)                (None, 6)                 11526
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7
=================================================================
Total params: 1,611,533
Trainable params: 1,611,533
Non-trainable params: 0
_________________________________________________________________

Compile and Start Training

## Compile Model
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

num_epochs = 10

history = model.fit(training_padded,
         training_labels_final,
         epochs=num_epochs,
         validation_data=(testing_padded, testing_labels_final))

output

Epoch 1/10
782/782 [==============================] - 8s 10ms/step - loss: 0.5179 - accuracy: 0.7336 - val_loss: 0.3833 - val_accuracy: 0.8250
Epoch 2/10
782/782 [==============================] - 8s 10ms/step - loss: 0.2824 - accuracy: 0.8834 -
...
Epoch 10/10
782/782 [==============================] - 8s 10ms/step - loss: 2.0107e-04 - accuracy: 0.9999 - val_loss: 1.3380 - val_accuracy: 0.7828

Let's Visualise training-validation loss-accuracy

import matplotlib.pyplot as plt
# accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
# "Loss"
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

output

Training and Validation Accuracy VS epoch
Training and Validation Loss VS epoch

What is embedding

tf.keras.layers.embedding turns positive integers (We encoded in preprocessing stage) into dense vectors of fixed size. This layer can only be used as the first layer in a model

We will learn about embedding in detail here until now let's visualize embedding in  Tensorflow embedding projector

Find reverse map index → Word

# Reverse map from index to word
inv_map = {v: k for k, v in word_index.items()}

Now get weights of the embedding layer i.e. first layer. Its shape must be vocab_size x embedding_dim

e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape)

output

(100000, 16)

Now download these weights and words in the .tsv file. Then go to the Tensorflow embedding projector and visualize.

import io

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')

vocab_size = len(inv_map.keys())

for word_num in range(1, vocab_size):
    word = inv_map[word_num]
    embeddings = weights[word_num]
    out_m.write(word + '\n')
    out_v.write('\t'.join([str(x) for x in embeddings]) + '\n')

out_v.close()
out_m.close()

References and Code

[1] Natural Language Processing in Tensorflow  by Laurence Moroney

You can find code