In Natural language processing , word embeddings are extremely important. An efficient embedding can give extremely accurate results. Let’s discuss a little about word embedding in this post.

### Introduction

Consider files like images , sound files or video files. They contain a lot of information about the contents. For eg: a set of pixels in an image carries a lot of information and this relation can be useful for further analyzing patterns in that image. Same is applicable for sound files. It carries a lot of information like frequency , amplitude etc which can be correlated. These information can be efficiently represented in matrix form and the relationship between these characteristics are preserved.

But in the case of text , how can we efficiently represent them. How can we find the relationship between individual words and represent them efficiently without loosing this relation.

### Language Models

Anything with text needs a language model. Language model tells how individual words come together.

A language model computes the probability for a sequence of words P(w1 ……wT).

#### One-hot representation

One-hot representation is a vector representation for words . The words are represented in a vector format with all elements 0 and one element 1.

For ex**: **Consider this sentence : *I love deeplearning.*

**I** is represented by –> [1 ,0 ,0]

**love** is represented by –> [0 , 1 , 0]

**deeplearning** is represented by –> [0 , 0, 1]

We can see that the vector is a sparse vector

Problem: The issue with one hot vector representation is the length of the vector is equal to vocabulary.

Vocabulary is the set of all words that we consider. As the size of vocabulary increases the size of one-hot vector also increases.

#### How to preserve the relation between words?

As we can see, one hot vector do not extract any kind of relation between words. We need methods to extract this relation.

If we can reduce the dimension to a lower dimension , may be we can extract some relation. This is similar to the PCA( Principal component analysis). So based on this intuition let’s see the possible techniques and ideas.

### Idea 1: Reduce Dimensionality

We can apply dimensionality reduction and map the words to a lower dimension. This way we can find the relationships.

Using SVD (singular value decomposition) of matrix we can reduce the dimensionality.

But this is not applicable to a large dataset since the complexity is quadratic.

### Idea2: Find the vectors without matrix decomposition

Without using highly complex algorithms, we can find an approximation of the lower dimensional matrix. This idea is based on the following hypothesis.

#### Distribution hypothesis

*“Words that are used and occur in the same context tend to support similar meaning”*

Surrounding words of a center word has same meaningful relation with that of the center word.

### Vector for word representation : word2vec and Glove

word2vec ,Glove etc are vector word representations that learn relationship between words.

Outputs of these representations are words with interesting relationships as below

vec(King) -vec(man) + vec(women) = vec(queen)

### Word2vec

Let’s dig in more into word2vec. word2vec consists of two algorithms

**CBOW(Continuous bag of words)**- Skipgram

CBOW predicts the center word from the context words

Skipgram predicts the context words given the center word.

### CBOW (Continuos Bag of words)

CBOW predicts the **center word** when the **context words** are given as input. In CBOW a neural net is used to approximate the context word matrix.

### Center word and context words.

In the first step we define a window size. The** window size** is the number of words we consider. For example in the above picture the window size is set as 2.

**window size is a ****hyper parameter**.

So 2 words on either side are taken. The blue colored word is the center word and the surrounding words in the window are context words.

We can see above how the training data and labels are created. Here the window size is set as 1 in above picture. (Note:Both pictures above are completely different projects)

### CBOW formula deduction

To make it easy for us to deduce the formula for the CBOW , lets consider a simple model with only one word in the context as input.

The input x

xis a one-hot representation vector, which means for a given input context word, only one out of V

Vunits, {x1,⋯,xV}{

x1,⋯,xV}, will be 1, and all other units are 0. for example,x=[0,⋯,1,⋯,0]

x=[0,⋯,1,⋯,0]

The weight between the input layer and the output layer can be represented by a *V*×*N* matrix *W*. Each row of *W* is the *N*-dimension vector representation **V w** of the associated word of the input layer.

So we have found the probability of the context words using the softmax function. We need to define a cost function so that we can maximize the proabability**. We can use log-likelihood** to maximize this.

In the next post , lets discuss more about the cost function and the mathematical steps to derive updates.

We will also discuss about skipgram model and will deduce the probability and cost function and finally we will look at how these two algorithms work in word2vec.