A collection of must known pre-requisite resources for every Natural Language Processing (NLP)…

A collection of must known resources for every Natural Language Processing (NLP) practitioner

An ultimate guide you would never like to miss !!

https://medium.com/media/01c88a46d667de3254ca7cd9ebd3d22e/href

Hey, are you a fresher wondering to dive into the world of NLP or a regular NLP practitioner who is confused with the vast amount of information available on the web and don’t know where to start from? Relax, I was the same until I decided to spend a good amount of time gathering all the required resources at one single place.

After thorough readings from multiple sources since last one year, here are my compiled versions of the best sources of learnings which can help anyone to start their journey into the fascinating world of NLP. There are a variety of tasks which comes under the broader area of NLP such as Machine Translation, Question Answering, Text Summarization, Dialogue Systems, Speech Recognition, etc. However to work in any of these fields, the underlying must known pre-requisite knowledge is the same which I am going to discuss briefly in this blog.

Just a quick disclaimer about the contents:
1. The contents which I am going to discuss mostly belongs to modern NLP and not that of the classical NLP techniques.
2. It’s impossible for anyone to go through all the available resources. I have applied my uttermost effort in whatever way I could do.
3. I assume that the reader is comfortable with at least a decent amount of knowledge regarding Machine Learning (ML) and Deep Learning(DL) algorithms.
4. For all the topics that I am going to cover, I have mainly cited the best resources in terms of blogs or videos. Readers can easily find the research papers for each individual topics. I feel the mentioned blogs are more than sufficient for anyone to fully understand the respective topics.

Here is my roadmap to the NLP world:
1. Word Embeddings — Word2Vec, GloVe, FastText
2. Language Models & RNN
3. Contextual Word Embeddings — ELMo
4. Transfer Learning in NLP — ULMFiT
5. Sentence Embeddings
6. Seq2Seq & Attention Mechanism
7. Transformers
8. OpenAI GPT & BERT
9. GPT-2, XLNet
10. Summary

Let’s briefly summarize the above 10 topics:

1. Word Embeddings — Word2Vec, GloVe, FastText

Well the first point that comes into mind when we start studying NLP is that how can we represent the words into numbers so that any ML or DL algorithm can be applied to it. That’s where the word vectors/embeddings come into play. As the name suggests, the aim here is to take as input any given word and outputs a meaningful vector representation that characterizes this word.
There exist different approaches to obtain this representation based on the underlying techniques such as Word2Vec, GloVe, FastText

Word2Vec:

To begin with this topic, I would suggest the reader watch lecture 1 & 2 of Stanford CS224N: NLP with Deep Learning | Winter 2019 freely available on YouTube.

https://www.youtube.com/watch?v=8rXD5-xhemo&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z&index=1

These two lectures form a solid background regarding semantic word representation. Apart from that, you also get to know the detailed mathematics involved in the working of both Word2Vec and GloVe model. Once you are comfortable with this, I would like to refer you to some of the blogs that I found most useful on this topic. In these blogs, you can find some of the examples and visualization that helps you gain a better understanding.

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/
http://jalammar.github.io/illustrated-word2vec/

I hope these readings are more than sufficient to give you a solid understanding of Word2Vec. Let’s move ahead.

GloVe:

GloVe was much better explained in lecture 3 of Stanford Natural Language Processing with Deep Learning (Winter 2017)

https://www.youtube.com/watch?v=ASn7ExxLZws

Apart from this, the following blogs help you obtain a clear picture of the topic and the mathematics behind this.

https://mlexplained.com/2018/04/29/paper-dissected-glove-global-vectors-for-word-representationexplained/
https://towardsdatascience.com/emnlp-what-is-glove-part-iv-e605a4c407c8

I hope you must have till now understood how GloVe takes advantage of the global statistics information unlike Word2Vec and optimizes a completely different objective.

FastText:

FastText is a library created by Facebook Research Team for efficient learning of word representations and sentence classification. It supports training CBOW or Skip Gram models similar to that of Word2Vec but it operates on n-gram representation of a word. By doing so, it helps in finding vector representation of the rare words by making use of character-level information.

Kindly refer following links for a better understanding:

https://towardsdatascience.com/fasttext-under-the-hood-11efc57b2b3
https://arxiv.org/pdf/1607.04606v1.pdf
https://arxiv.org/pdf/1607.01759.pdf

If you are done with the above-mentioned pointers, you must be now at least having a deeper understanding of the word embeddings approach. It’s time to move into the backbone of NLP — Language Models.

2. Language Models(LM) & RNN

Language models are what we use on a daily basis. One such scenario occurs while texting a message be it on your cellphone or Gmail or LinkedIn. LM provides you with the most probable suggestions that you would like to type further. In simple words, LM is a task of predicting what word comes next. And my argument about LM being the backbone of NLP is because all the current state of the art transfer learning models depends on LM as the underlying task. You will get to know further about these in your upcoming journey. But before that, let’s look at the resources to understand LM.

As usual, my first go-to suggestion over here is to go through some of the wonderful lectures from Stanford on this particular topic.
Lecture 6 of CS224N covers this topic beautifully. It gives you a glimpse of how LM was developed prior to neural networks and what advantages does neural networks basically RNN brings to this. Also, if you would like to re brush your knowledge regarding RNNs, kindly refer lecture 7 for the same.

https://www.youtube.com/watch?v=iWea12EAu6U&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z&index=6
https://www.youtube.com/watch?v=QEw0qEa0E50&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z&index=7

Also, if you feel less knowledgeable about the inner working of RNNs, you can go for a wonderful course that I studied on Udemy-

Deep Learning: Advanced NLP and RNNs
https://www.udemy.com/course/deep-learning-advanced-nlp/

This is one of the best courses that I found useful in the vast collection of online courses available on the web. In this course, you can understand the working of RNNs by unrolling them, turning them to bidirectional etc. Also, you can learn to code these models in Keras — one of the simplest deep learning framework to get started with.

3. Contextual Word Embeddings — ELMo

Guess what word embeddings are back !! Wait for a second, there is a term ‘contextual’ which makes it different from our previous studied approaches.
OK, then why did not we study this topic together with the first topic. Hmmm, only because we need the knowledge of LM to understand this topic. And yeah, as I had mentioned earlier, we have come across the first application of LM in our ongoing journey. Trust me by the end, you will agree with me with awarding LM as the backbone of NLP. Enough said, let’s jump into our current topic — Contextual Word Embeddings

Embeddings from Language Models (ELMo) uses LM to obtain embeddings of individual words. Until now we were having a single embedding of any input word, for example, say a bank. Now suppose I have 2 different sentences — I went to withdraw money from the bank and I was standing near the bank of a river. In both these sentences, the meaning of the word bank are completely different and therefore they must be having different vector representation. This is what contextual embeddings aim at. ELMo is one such approach based on the multi-layer bidirectional LSTM models for obtaining contextual word embeddings. Please go through following blogs to learn about them.

https://mlexplained.com/2018/06/15/paper-dissected-deep-contextualizedword-representations-explained/
https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018

I hope that the above two resources are sufficient enough to help you get a better understanding of ELMo. It’s time to move ahead …

4. Transfer Learning in NLP — ULMFiT

Transfer Learning has completely revolutionized NLP domain in the last one year. Most of the current state of the art algorithms that are being developed makes use of this technique. After making a significant contribution to the field of Computer Vision, Transfer Learning has finally rejoiced NLP practitioners.
Universal Language Model Fine-tuning for Text Classification (ULMFiT) is one such approach that should be credited for this wonderful change.
ULMFiT introduced methods to effectively utilize a lot of what the model learns during pre-training — more than just embeddings and more than
contextualized embeddings. ULMFiT introduced a language model and a process to effectively fine-tune that language model for various tasks.
Finally, pre-training and fine-tuning concepts started showing its magic power in the NLP field. ULMFiT paper also introduced different techniques such as Discriminative Fine Tuning and Slanted Triangular Learning rates that helped in improving the way the transfer learning approach could be utilized.

Ready to explore these exciting terms, then keep calm and refer the following blogs:

http://nlp.fast.ai/classification/2018/05/15/introducing-ulmfit.html
https://ahmedhanibrahim.wordpress.com/2019/07/01/a-study-on-cove-context2vec-elmo-ulmfit-and-bert/
https://yashuseth.blog/2018/09/12/awd-lstm-explanation-understanding-language-model/

Well by now, you must be quite familiar with ULMFiT. Next in our journey comes the Sentence Embedding.

5. Sentence Embeddings

Learnt enough about the word embeddings. What about the sentence? Can we obtain some representation of a sentence similar to that of a word? One very naive but a strong baseline approach would be to average the sentence’s word vectors (so-called Bag-of-Word approach). Apart from this, there can be different approaches based on unsupervised, supervised and multi-task learning set up.

Unsupervised schemes learn sentence embeddings as a byproduct of learning to predict a coherent succession of sentences. The main advantage over here is that you can get plenty of unsupervised data since the internet is full of text kinds of stuff. Skip Thought Vectors and Quick Thought Vectors are two such successful approaches which have been developed in the unsupervised setting.
On the other hand, supervised learning requires a labelled dataset annotated for a given task. Accomplishing this task lets you learn a good sentence embedding. InferSent is one such interesting approach by Facebook Research Team.
Now to resolve the conflict between unsupervised and supervised embeddings, multi-task learning set up comes into the picture. Several proposals for multi-task learning were published such as MILA/MSR’s General Purpose Sentence Representation, Google’s Universal Sentence Encoder etc.

Excited to enter this world. Explore & explore the mentioned links:

https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a
https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html
https://towardsdatascience.com/deep-transfer-learning-for-natural-language-processing-text-classification-with-universal-1a2c69e5baa9

6. Seq2Seq & Attention Mechanism

Having learnt the variants of RNN Models as well as having a good understanding of word and sentence embeddings, it’s time to move ahead to an exciting NLP architecture known as Sequence 2 Sequence models(Seq2Seq). This architecture is used in a variety of NLP tasks such as Neural Machine Translation, Text Summarization, Conversational Systems, Image Captioning, etc. A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an image …etc) and outputs another sequence of items. The best way to understand these models is with the help of visualization, and this is where I would like to refer you to one of my most loved NLP author’s blog. He is none other than Jay Alammar. Believe me, you would love to go through each of his blogs. The efforts he uses to explain these terms are outstanding. Click below link to enter into this beautiful world.

http://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

I think I do need to explain you further regarding Seq2Seq, as by now you must be well conversed with it.
However, now I would like to refer you again to the Stanford lectures to learn more about Statistical and Neural Machine Translation. Having knowledge of Seq2Seq would help you move fluently with these lectures. Also, Attention which is one of the most important topics is discussed in detail there. Together with that, you will also get to know about the Beam Search Decoding & BLEU metric used for evaluating NMT models.

Kindly refer CS224N Lecture — 8
https://www.youtube.com/watch?v=XXtpJxZBa2c&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z&index=8

7. Transformers

It’s time for the beast — Transformers. While LSTM models were revolutionizing the NLP industry, it was Transformers that was developed out of the box as an improved replacement of the RNN models.
The Transformer is a model that uses attention to boost the speed with which these models can be trained.
The Transformer lends itself to parallelization. The Transformer was proposed in the paper Attention is All You Need. Due to the parallelization nature, it frees us from the recurrence connections involved in the RNN models. Not only it helps in reducing the training time, but also in improving the accuracy by a great margin on various NLP tasks. It is similar to Seq2Seq architecture but it depends only on the attention mechanism along with their variants. Again, the best blog to understand this topic is by Jay Alammar. In fact, as mentioned earlier you can follow all his blogs to learn about these advanced NLP techniques.

http://jalammar.github.io/illustrated-transformer/

Apart from this if you want to understand this paper in terms of implementation point of view, then please refer this awesome annotated blog by Harvard NLP group.

https://nlp.seas.harvard.edu/2018/04/03/attention.html

If you have successfully understood the above two blogs, then give yourself a big thumbs up!! Believe me, it was not an easy task.
Let’s explore now regarding how researchers have utilized this newer architecture to build State of the Art models like BERT, GPT-2, etc.

8. OpenAI GPT & BERT

Transfer learning is back but now of course with Transformers. It’s as simple as follows: utilize Transformer decoder’s stack to build a newer model called GPT or utilize the encoder part of the Transformer to build an amazing model named BERT.
Believe me, even if you are very new to NLP field and have been just listening to NLP buzzwords in the last one year then BERT and GPT are the toppers in this list.

Generative Pre-Training(GPT) goal is similar to that of ULMFit i.e. to apply transfer learning in NLP. But there is a major difference. Yeah, you got it right — using Transformer instead of LSTM. Apart from that, there are also some of the difference in the training objective which you can learn about after going through the below-mentioned blogs. To summarize, the overall idea of GPT is to train the transformer decoder for language modelling task also known as pre-training. Once it is pre-trained, we can start to use it for downstream tasks. There can be a number of input transformations to handle a variety of such tasks.

Here comes the most buzzing word of NLP — BERT.

The main objective was to build a transformer-based model whose language model was conditioned on both left as well as right context. This was the limitation of GPT since GPT only trains a forward language model. Now to achieve the objective of bidirectional conditioning, BERT made use of encoder part of the Transformer. And in order to not see the future words while calculating attention scores, it uses a special technique called masking. According to the authors of this paper, this masking technique was the greatest contribution of this paper. Apart from the masking objective to handle relationships between multiple sentences, the pre-training process includes an additional task: Given two sentences (A and B), is B likely to be the sentence that follows A, or not?

Well, if you are feeling burdened by some of the above terms and want a piece of deeper knowledge about them, just be relaxed. All these terms are beautifully explained in the following blogs:

http://jalammar.github.io/illustrated-bert/
http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
https://mlexplained.com/2019/01/07/paper-dissected-bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained/
https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/
https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a

9. GPT-2, XLNet

GPT-2 was nothing but a successor to GPT with more than 10X the parameters and trained on more than 10X the amount of data. Due to the concerns about malicious applications of the technology, the authors initially did not release the larger trained model which became a debatable topic.
XLNet is a generalized autoregressive model. It outperforms BERT on 20 tasks, often by a large margin. It is the new go-to technique for transfer learning in NLP.
For a broader understanding of GPT-2 and XLNet, please refer the below blogs.

http://jalammar.github.io/illustrated-gpt2/
https://openai.com/blog/better-language-models/
https://towardsdatascience.com/openai-gpt-2-understanding-language-generation-through-visualization-8252f683b2f8
https://towardsdatascience.com/what-is-xlnet-and-why-it-outperformsbert-8d8fce710335

10 Summary

Finally, you have covered the entire journey of learning NLP pre-requisite as per our proposed plan. Kudos to you !!! Models keep on involving due to a large number of researchers working actively in this field. And so, within almost every month, you come across a new paper which beats the previous state of the art. So, the only way to move ahead with this fast-changing world is to remain updated with the latest knowledge by going through the research papers on a regular basis.

If you want to recollect the entire journey, go through the following mentioned blog once.
https://lilianweng.github.io/lil-log/2019/01/31/generalized-language-models.html

Here, I am also listing below some of the best blogs which I find most useful while learning about new topics in the field of NLP:

https://medium.com/huggingface
http://jalammar.github.io/
https://ruder.io/
https://mlexplained.com/
https://mccormickml.com/

I hope my resources were helpful to the reader. As said earlier, it’s impossible for anyone to cover the entire topics. Suggestions are always welcome citing some of the better blogs or the important topics which I have missed. I hope anyone who is thorough with these pre-requisites can work in any of the NLP tasks. Till then, cheers, enjoy !!!


A collection of must known pre-requisite resources for every Natural Language Processing (NLP)… was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s