Um tempo para reorganizar as ideias e se reconectar consigo mesmo

O mundo ao nosso redor parece estar acelerando deixando a sensação de que o tempo está indo rápido demais. Dias e noites estão mais curtos, tarefas se acumulam, nossos objetivos ficam mais e mais…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Sequence Based Text Classification with Convolution Nets

Apparently Feynman once said about the universe, “it’s not complicated, it’s just a lot of it“. A machine might be thinking the same, while dealing with a plethora of words and trying to make sense of relations between them. We use and interpret words or sentences in a context. A lot of context arises from a sequence. Unfortunately, infinitely many sequences of words are possible and most of them are gibberish and yield no context. You might have seen with some approaches like bag of words, we usually side-step this complexity by ignoring the specific sequences of words altogether and instead treating documents as bags of words as per the VSM approach. But looks like there are some new generation models in town (or at least feels like new, even though they have existed for decades) that could account for sequences of words without breaking the bank, and we’re going to see one of those in action today.

The problem with approaches like bag of words, document vector approach is that these techniques don’t emphasize on the sequence of the words and their respective occurrences or relations. Two sentences could have same representation but totally different meaning. Let’s consider one sentence: ‘Thank you for helping me’, now by altering a few words it could be written as, ‘You thank me for helping’. This rearrangement of words though makes a completely new sentence but it’ll be observed as the same when viewed from bag of words approach.

This is where we need such a model which extracts the relation or patterns from a group of words, not from an individual word. In such a scenario, Convolutional Neural Networks (CNN) can produce some really good results. CNNs are known for extracting the patterns from the underlying sentences by focusing on a sequence of words at a time.

Before moving on to the original classification task let’s discuss in brief about Convolution Nets. CNN i.e. a ‘Convolutional Neural Network’ is a class of deep, feed-forward artificial neural networks, most commonly used for analyzing images. These networks use a variation of multi-layer perceptrons designed to minimal pre-processing.

A typical CNN model consists of an input and an output layer, as well as multiple hidden layers. The hidden layers of a CNN consists of convolutional layers, pooling layers, fully connected layers and normalization layers (optional).

Convolution layers receives the input and it uses a convolution filter to convolve over the underlying data points to extract meaningful patterns out of it. For example in case of a 2D-image, this will look something like this:

This extracted set of features are then passed to MaxPooling layer, which filters the dominating terms from the extracted features. The illustration for that would look like this:

These filtered terms are then passed onto the final layers before classification, which are flatten & fully connected layers. And the output of fully connected layers are what we see as the final classification results.

So, with this post we are trying to classify custom text sequences using CNN and Naive Bayes with tf-idf for comparison. CNN here is implemented with the help of Keras and tensorflow as its back-end. For Naive bayes we are using sklearn or scikit-learn.

Now we’ll see how encoding and CNN part are implemented to obtain the final results. The main imports for this task are as follows:

In the code snippet below, we fix some random seeds for tensorflow and numpy so that we can get some reproducible results.

Use the Keras text processor on all the sentences/sequences so it can generate a word index and encode each sequence (Lines 2–4 below) accordingly. Note that we do not need padding as the length of all our sequences is exactly 15.

CNN implementation will consist of the embedding layer, convolution layer, pooling layer (we’ll use MaxPooling here), one flatten layer and the final dense layer or the output layer.

Fig 4. Convolution & MaxPooling Layers

Fifteen words (where each word is a 1-hot vector) in sequence are pumped as the input to an embedding layer that learns the weights for order reduction from 995 long to 248 long numerical vectors. This sequence of 248-long vectors are fed to the first layer of the convolution model i.e. convolutional layer.

The 80% of the data is split into validation (20% of this 80%, i.e 16% overall) and training sets (64% of the overall data) in multiple training/validation simulations.

After executing the model, following results are obtained,

Fig 7 Confusion Matrix & Classification Report

As you can observe with CNN, for some classes we even obtained f1-scores of 0.99 and hence this proves that CNN is really good at performing text classification tasks.

Fig 8. CNN Training/Validation accuracy & Loss — Avg weighted Confusion Matrix across 10 iterations

With the Naive Bayes though the results aren’t that promising, we obtained a f1-score of only 0.23, which is very low as compared to approx 0.95 of CNN.

Figure below shows the f1-score plots of naive bayes and CNN side by side. The diagonal dominance observed here is due to the high f1-scores of CNN.

Fig 9. Confusion Matrix & Classification report for Naive Bayes
Fig 10. CNN does pretty well in predicting all three classes whereas Naive Bayes seems to be struggling in all three.

We worked with a synthetic text corpus designed to exhibit extreme sequence dependent classification targets. And we have seen that pattern recognizing approach such as CNN has done extremely well compared to traditional classifiers working bag-of-words vectors that have no respect for sequence.

The question is how CNN might do on regular text where we know the sequence of words may not the end all and be all. We will take it up in the next post.

Add a comment

Related posts:

Cara Setup SSH pada akun Github

Hari senin sudah datang kembali, semangat senin kawan-kawan semua. Pada kesempatan kali ini saya akan membagikan pengalaman menggunakan SSH di akun github saya. Sebelum menggunakan SSH, saya menemui…

What You Need To Know About International Trade

When you walk into a supermarket and buy anything from edible to an antique piece from the marketers of other countries then you are actually experiencing the international trade. In simple words…

5 Reasons to Embrace the Rain

Kindaba is a private social network for your family. A place for you to share, discuss, create, plan, and organise things with those that matter. All private, and protected. Technology is everywhere…