Word embedding and word2vec

Image result for word embeddings

In Natural language processing , word embeddings are extremely important. An efficient embedding can give extremely accurate results. Let’s discuss a little about word embedding in this post.

Introduction

Consider files like  images , sound files or video files. They contain a lot of information about the contents. For eg: a set of pixels in an image carries a lot of information and this relation can be useful for further analyzing patterns in that image. Same is applicable for sound files. It carries a lot of information like frequency , amplitude etc which can be correlated. These information can be efficiently represented in matrix form and the relationship between these characteristics are preserved.

But in the case of text , how can we efficiently represent them. How can we find the relationship between individual words and represent them efficiently without loosing this relation.

Language Models

Anything with text needs a language model.  Language model tells how individual words come together.

A language model computes the probability for a sequence of words P(w1 ……wT).

One-hot representation

One-hot representation is a vector representation for words . The words are represented in a vector format with all elements 0 and one element 1.

For ex: Consider this sentence : I love deeplearning.

I is represented by       –> [1 ,0 ,0]

love is represented by  –>  [0 , 1 , 0]

deeplearning  is represented by  –>  [0 , 0, 1]

We can see that the vector is a sparse vector

Problem: The issue with one hot vector representation is the length of the vector is equal to vocabulary.

Vocabulary is the set of all words that we consider. As the size of vocabulary increases the size of one-hot vector also increases.

How to preserve the relation between words?

As we can see, one hot vector do not extract any kind of relation between words. We need methods to extract this relation.

If we can reduce the dimension to a lower dimension , may be we can extract some relation. This is similar to the PCA( Principal component analysis). So based on this intuition let’s see the possible techniques  and  ideas.

Idea 1: Reduce Dimensionality

Image result for word embeddings

We can apply dimensionality reduction and map the words to a lower dimension. This way we can find the relationships.

Using SVD (singular value decomposition) of matrix we can reduce the dimensionality.

But this is not applicable to a large dataset since the complexity is quadratic.

Idea2: Find the vectors without matrix decomposition

Without using highly complex algorithms, we can find an approximation of the lower dimensional matrix. This idea is based on the following hypothesis.

Distribution hypothesis

“Words that are used and occur in the same context tend to support similar meaning”

Surrounding words of a center word has same meaningful relation with that of the center word.

Vector for word representation : word2vec and Glove

word2vec ,Glove etc are vector word representations that learn relationship between words.

Outputs of these representations are words with interesting relationships as below

  vec(King) -vec(man) + vec(women)   =  vec(queen)

Word2vec

Let’s dig in more into word2vec. word2vec consists of two algorithms

  1. CBOW(Continuous bag of words)
  2. Skipgram

CBOW predicts the center word from the context words

Skipgram predicts the context words given the center word.

CBOW (Continuos Bag of words)

Image result for cbow

CBOW predicts the center word when the context words are given as input. In CBOW a neural net is used to approximate the context word matrix.

Center word and context words.

Related image

In the first step we define a window size. The window size is the number of words we consider. For example in the above picture the window size is set as 2.

window size  is a hyper parameter.

So 2 words on either side are taken. The blue colored word is the center word and the surrounding words in the window are context words.

Image result for cbow

We can see above how the training data and labels are created. Here the window size is set as 1 in above picture. (Note:Both pictures above are completely different projects)

CBOW formula deduction

Image result for cbow

To make it easy for us to deduce the formula for the CBOW , lets consider a simple model with only one word in the context as input.

The input x

x is a one-hot representation vector, which means for a given input context word, only one out of V

V units, {x1,⋯,xV}

{x1,⋯,xV}, will be 1, and all other units are 0. for example,

x=[0,⋯,1,⋯,0]

x=[0,⋯,1,⋯,0]

The weight between the input layer and the output layer can be represented by a V×N matrix W. Each row of W is the N-dimension vector representation Vw of the associated word of the input layer.

 

Selection_012

 

So we have found the probability of the context words using the softmax function.  We need to define a cost function so that we can maximize the proabability. We can use   log-likelihood to maximize this.

In the next post , lets discuss more about the cost function and the mathematical steps to derive updates.

We will also discuss about skipgram model and will deduce the probability and cost function and finally we will look at how these two algorithms work in word2vec.

 

Advertisements

What is learning in ML algorithms.

downloadMachine learning is a big trend in the industry now. The performance of these algorithms are improving every day and coming in par with human skills.

A machine learning algorithm is an algorithm that learns from data. But what does it really mean by ‘learning’. How can we define it accurately.Only something that is defined can be analysed and measured right.

In the book (‘Machine Learning’ by Tom Mitchell (1997))  , learning is defined as follows

A computer program is said to learn from experience E    w.r.t  some class of task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

 

Xavier Initialization

I was reading code implementations of various GAN models and something very beautiful struck me. It was a very simple function used in all of those implementations called “Xavier Initialization”.

After some digging up in google I found out that its something worth writing for today. I have heard about Xavier initialization first in  cs234n lecture by Stanford. I didn’t search for it then, but today I have finally found it.

So what’s it?

Xavier Initialization is a method used to initialize neurons in Deep learning networks. When we start to model deep neural nets , the initialization values are always confusing. Proper initialization can get your network converged faster or getting stuck.

  • If the range of values are small then it might not reach deep into the network because it can fade away.
  • If the initial values are large then it might reach at the end with massive values which make it useless.

Xavier’s method of initialization take away these problems by  initializing the weights in network from a distribution with zero mean and a specific variance,

\text{Var}(W) = \frac{1}{n_\text{in}}

How it works?

As I said earlier, the Xavier’s method keeps the variance in all layers same. Let’s assume the neural network as

Y = W_1X_1 + W_2X_2 + \dotsb + W_n X_n

Let’s take the variance

Var(y) =Var(w1x1 + w2x2 + … + wNxN + b)

Since b is a constant ,its variance turns out to be 0

 

Note:If two variables are independent we can write the variance as below

{\begin{aligned}\operatorname {Var} (XY)&=[E(X)]^{2}\operatorname {Var} (Y)+[E(Y)]^{2}\operatorname {Var} (X)+\operatorname {Var} (X)\operatorname {Var} (Y).\end{aligned}}

Here E() is the Expectation. If the weights are from a Gaussian distribution , the mean is zero. So E() terms turn out to be zero.

\text{Var}(W_iX_i) = E[X_i]^2 \text{Var}(W_i) + E[W_i]^2 \text{Var}(X_i) + \text{Var}(W_i)\text{Var}(i_i)

 

\text{Var}(W_iX_i) = \text{Var}(W_i)\text{Var}(X_i)

 

substituting this equation we get,

\text{Var}(Y) = \text{Var}(W_1X_1 + W_2X_2 + \dotsb + W_n X_n)

 

If we assume that X and W are independent and distributed identically then,

\text{Var}(Y) = \text{Var}(W_1X_1 + W_2X_2 + \dotsb + W_n X_n) = n\text{Var}(W_i)\text{Var}(X_i)

 

Var(Y) is variance of output.  Var(X) is variance of input. From above equation it turns out that variance of output is   n*Var(W) times scaled value of variance of input.

 

To make the variance of output and input the same  n*Var(w) must be equal to 1.

\text{Var}(W_i) = \frac{1}{n} = \frac{1}{n_\text{in}}

 

Xavier initialization in Frameworks

 

Tensorflow

W = tf.get_variable("W", shape=[784, 256],
           initializer=tf.contrib.layers.xavier_initializer())

Caffe

weight_filler { type: "xavier" }

GAN -Generative Adversarial Network

“(GANs), and the variations that are now being proposed is the most interesting idea in the last 10 years in ML

These are the words of Yann LeCun, one of the very famous researchers in Deep Learning. GAN or Generative Adversarial Network is considered to be one the most promising developments in ML last year.

What is GAN?

If you have some idea about deep learning networks you will know about using them for various tasks like NLP, image recognition, object detection etc. Neural networks work extremely well in all these tasks. But does it mean it can do well in all areas? Imagine this situation. An artist is creating a masterpiece drawing. A neural network might be able to recognize objects in the image, but can we create a network which can create a great painting like that. Can a machine write beautiful articles and publish them. This is considered to be of great difficulty since we train the networks with existing data. The problem here is to make the network generate its own data. This is a Generative task.

The issue with the Generative task.

Imagine you are learning chess. If you want to improve your game you need a powerful opponent who does innovative moves. A chess player improves his game skills by analyzing his moves and finding the mistakes. If we are creating a model for chess we train it using many moves we manually prepare.

Image Generation Problem.

Image generation problem is where the machine learning model generates an image by itself. Imagine a machine creating a drawing as beautiful as Picasso. We give a set of images for training and the model generates the image while testing that it is similar to the training images, but are not same.

Problem

Here the issue is that we need a way to score the output image. If there are 2 output images, how can we say which one is better?

GAN proposes to use a neural network for this process. So in addition to the model, there is another neural network that scores the image output. The neural net that generates the image is called Generator and the one that scores the image is called the Discriminator.

 

Working of GAN?

As I said earlier there are 2 neural nets basically. Generator and Discriminator.

Generator

The generator has the duty of generating a novel image based on the patterns in the training samples.It uses the probability distribution to generate the similar patterns of the test set in the output image. We can see Generator as the function G(z). It takes the input z which is the sample from the probability distribution p(z).

Discriminator

Discriminator takes input from two sources. Real data and generated data from Generator. Its duty is to score the images into classes – Generated, Real. i.e it tells whether the input image is a real image or a generated image. It uses a sigmoid classifier to score the images

Training GAN

Training of GAN is done similar to two agents, playing with each other in a reinforcement learning setup. Here, Generator is trying to create an image that is indistinguishable by Discriminator and Discriminator is trying to discriminate the image from the real image. Like a game right?

Mathematically it is represented as

In the function V(D, G), the first term is the entropy of log likelihood of sample data from Probability Distribution of real data. Discriminator tries to maximize this value to 1. The second term is the entropy of generated image(fake image). The discriminator tries to maximize this to 0.

 

Applications of GAN

1. Increasing resolution of Image

2. Text to image generation

This is implemented with some modifications called Conditional GAN. Paper

3. Interactive Image Generation

Conclusion

Here I have tried to give a simple and brief idea about GAN.In the next post, let’s look at training a GAN more practically and let’s implement one in tensorflow.

Happy new year

It’s 2018. Happy new year to everyone. Last year has been amazing. Many memories and learnings. Planning to make a lot of crazy adventures this year. New year resolution of this year is to blog every day about the stuff I learn in the field of Artificial intelligence, Computing, and Mathematics. I have found that when I blog about a particular topic I am understanding it a lot better. I am extremely excited about all the new things I am gonna learn this year.

Wishing everyone a crazy and exciting year ahead.

Learning Convex Optimization

Starting to learn a very new topic. After undergrad maths classes , its been a while since I learned any new maths concepts. After going through many lectures on ML and deep learning I find that most of the problems are optimization problems. That’s when I got introduced to Convex optimization. Found this amazing book. […]

Building a Neural Net to recognize digits

Let’s build a neural net to recognize handwritten digits. This was a very difficult problem about 5 years back or so. I remember watching a programme in Discovery channel that showed how a post office used high tech cameras and software to sort out their  mails with handwritten pin codes. Today the same recognition is possible with accuracy of more than 95% with neural nets with less than 100 lines of code.

Difficulty level —- >  beginner

This is a very simple neural net to make using an ML Deep learning library.Also this task is considered as Hello world task in deep learning by experts.. Learning and understanding all the theory behind is a little bit difficult.

Tech Stack

I am using Keras python Library to build a convolution net that is fed with MNIST dataset.

List of tools and libraries used:

  1. Keras (Tensor flow backend)
  2. numpy
  3. matplot
  4. Spyder IDE
  5. Conda for virtual environment

Task 1- Setup environment.

Let’s start with installing and setting up all libraries needed for this task. Conda is used for this. It’s a powerful package management and collaboration tool. _images/conda_logo.svgYou can install all required packages in a conda environment. See this documentation for installation instructions

New environment can be created using the following commandconda

Here DL is the name I have given to the environment. You can give any name.We can create more such environment as we need and keep our system clean from different libraries. The environment needs to be activated. It can be done as follows
activate DL                           (if you are in windows)
source activate DL                    (if you are in Linux)  

Install libraries using the command

conda install

we need to install packages –> Tensor flow, numpy, keras, Matplotlib

Installing Spyder IDE

@spyder-ide Spyder IDE is best suited for data analysis, plotting etc. You can also write the code in a plain Ide like notepad or gedit. I am using spyder because of all the features it provides. The ide looks as below image

spyder

Adding the python packages to Spyder path.

Sometimes the path for all the libraries we installed inside conda environment is not detected by Spyder ide. You can add the path to Spyder by Tools -> Pythonpath manager and add the path.Update modules by Tools -> Update module list. Sometimes you need to close and start the Ipython console at bottom right to apply the changes.

Start coding.

We have done lots of setup now. Let’s start coding.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(923)

import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation,Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.utils import np_utils

As the first step let’s import all required libraries. We are using MNIST dataset. It’s already included with Keras library. As highlighted we are importing the dataset also to our python file.

Dataset

mnist samplesMNIST dataset consist of images of handwritten digits. Each images is of dimension 28×28 and there are 6000 of them in the set. The image is represented as a 2D matrix. So the complete dimension will be 28x28x6000.

Importing Neural net layers from Keras

https://i2.wp.com/luizgh.github.io/assets/lasagne_basics/lenet5.png

The convnet we are developing is similar to above picture. We need to import the layers from keras. If we were building it in pure python we need to build each layer from scratch. But keras comes preloaded with all the required layers. We are importing  Dense layer, Convolution layer(2D) , Max pooling layer ,Dropout layer

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np np.random.seed(923)
import keras from keras.datasets
import mnist from keras.models
import Sequential
from keras.layers.core import Dense,Dropout,Activation,Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.utils import np_utils
 

 

Setting up sizes

Here I am setting batch size as 128. That is the backpropagation algorithm will run as batches of size 128 each.

nb_classes variable is the number of classes that the dataset images belongs to. The neural net must predict one of these classes at the end if an image of a digit is given as input. Since our problem is to predict the digit , there are 10 classes (0-9)

image dimensions are set as 28×28. Since the images are represented as rows and cols in matrix we set img_rows and img_cols variable as 28

batch_size = 128

nb_classes = 10 # for 10 digits

#input image dimensions

img_rows , img_cols = 28,28

Loading the data

Now lets load the data. As i said earlier the data of images are in matrix form. We need to split the dataset to 2 types.

  1. Training data
  2. Testing data
Training Data

This is used to train our models. We define two variables X_train and y_train. X_train  will have the image matrix and y_train will have corresponding digit class

Testing data

This is used to test accuracy of our model.We have two variables for that as well. X_test,y_test

# load the data,
(X_train, y_train) , (X_test,y_test) = mnist.load_data()
Selection_006

If you are using spyder ide then variable explorer can be used to view the X_test and y_test variables.

Adding channel information

Here the dimension of the image right now is 6000x28x28. This does not have the channel information of the image.Color images have channel information for (R,G,B). Here we are reshaping the image matrix to add the channel information. This is required because the library function is expecting it. If we don’t add that information it may cause error. Since our images do not have color lets add 1 as channel .

channel first and channel last

The number of channels can be set in two orders in Keras. Either the channel number is given before the rows and cols information or we can give it at the end. This order can be configured in keras backend.

Kera stores all these configuration in a json file. In linux its stored in

$HOME/.keras/keras.json

NOTE for Windows Users: Please replace $HOME with %USERPROFILE%.

The default configuration file looks like this:

{
    "image_data_format": "channels_last",
    "epsilon": 1e-07,
    "floatx": "float32",
    "backend": "tensorflow"
}

Here we can set the backend for Keras to use. Either tensorflow or CNTK or Theano. The image_data_format is what we are interested in now. It specifies the order in which we must set the dimensions of input images. channels_last means the channel info is given at the end , channels_first means the opposite.

if K.image_data_format() == 'channels_first':
X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)

Categorize the classes

Each digit belongs to a class from 0 to 9. But our Neural network cannot understand the classes as a digit. This kind of data is categorical data. We need to categorize this 10 classes. To do that we make 10 columns .each indicating one of the classes. If an image belongs to a particular task our model sets one to that particular column.

To categorize the data , Keras provides to_categorical function

#categorize 
Y_train = np_utils.to_categorical(y_train,nb_classes) 
Y_test = np_utils.to_categorical(y_test,nb_classes)

Creating the model

Now all the preprocessing is complete. Let’s start to build our model. Keras provides layers class with all standard layers we need to create a convolution network

 model =Sequential()
 model.add(Convolution2D(6,5,5,input_shape =input_shape,border_mode='same'))
 model.add(Activation('relu')) 
 model.add(MaxPooling2D(pool_size=(2,2))
 model.add(Convolution2D(16,5,5,border_mode='same'))
 model.add(Activation('relu'))
 model.add(MaxPooling2D(pool_size=(2,2)))
 model.add(Convolution2D(120,5,5,border_mode='same'))
 model.add(Activation('relu'))
 model.add(Dropout(0.25))
 model.add(Flatten())
 model.add(Dense(84))
 model.add(Activation ('relu'))
 model.add(Dropout(.5))
 model.add(Dense(10))
 model.add(Activation('softmax'))

Sequential() object is initialized to create a sequential model. add() function is used to stack new layers to the sequential model

model.add(Convolution2D(6,5,5,input_shape =input_shape,border_mode=’same’))

  • We are using 6, 5×5 filters in this convolution layer. That’s the first 3 parameters.
  • input_shape is the dimension of images given as input.In keras we only need to give the input size for the first layer.All other layers can automatically find the dimensions.
  • border_mode =’same’ sets all images to be of same size.More info can be found here

model.add(Activation(‘relu’))

  • This is an activation Layer. The activation function is ‘Relu‘. We are also using a ‘softmax’ function at the end of our model.

model.add(MaxPooling2D(pool_size=(2,2))

model.add(Dense(10))

 

Compiling the model

Till now we have designed our model. We have not yet compiled it.

 model.compile(loss='categorical_crossentropy',optimizer='adadelta')

We need to use compile() function to compile the model. We are giving the loss function and optimizer algorithm as parameters. ‘categorical_crossentropy‘ is used here as the loss function. ‘adadelta‘ is used for backpropogation.

 

Fitting the model

Once compilation is done , the next step is to fit the model. Here the fit() function is used in keras. We need to give the training(X_train,Y_train) data, batch_size, number of epochs and test data(X_test,Y_test) as input

 model.fit(X_train,Y_train,batch_size=batch_size,epochs=nb_epoch,verbose=1,validation_data=(X_test,Y_test))

This is the output.we can see the model getting trained.

Selection_007

Evaluation of our model

Our model is complete and is trained. Let’s try to valuate it.

score = model.evaluate(X_test,Y_test,verbose=0)

evaluate() function produces a score for our prediction

Prediction

res =model.predict_classes(X_test[:9])

Image result for mnist prediction

Now our model is complete. We can now predict a handwritten digit image using our model. predict_classes() function is used to predict. The output is plotted as above.

Final code

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
"""
Created on Tue Nov 21 08:16:25 2017

@author: akshaynathr
"""

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(123)

import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation,Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.utils import np_utils
import keras.backend as K
batch_size = 128
nb_classes = 10 # for 10 digits

#input image dimensions
img_rows , img_cols = 28,28

# load the data,
(X_train, y_train) , (X_test,y_test) = mnist.load_data()

if K.image_data_format() == 'channels_first':
X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
X_train /= 255
X_test /= 255

print("X_train shape:" , X_train.shape)
print("X_test_shape:" , X_test.shape)

#categorize
Y_train = np_utils.to_categorical(y_train,nb_classes)
Y_test = np_utils.to_categorical(y_test,nb_classes)

print("One hot encoding: {}".format(Y_train[0,:]))

for i in range(9):
plt.subplot(3,3,i+1)
plt.imshow(X_train[i,0], cmap ='gray')
plt.axis('off')

model =Sequential()
model.add(Convolution2D(6,5,5,input_shape =input_shape,border_mode='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Convolution2D(16,5,5,border_mode='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Convolution2D(120,5,5,border_mode='same'))
model.add(Activation('relu'))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(84))
model.add(Activation ('relu'))
model.add(Dropout(.5))
model.add(Dense(10))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',optimizer='adadelta')

nb_epoch =2
model.fit(X_train,Y_train,batch_size=batch_size,epochs=nb_epoch,verbose=1,validation_data=(X_test,Y_test))
score = model.evaluate(X_test,Y_test,verbose=0)
score
print("Test score:",score[0])
print("Test accuracy",score[1])

res =model.predict_classes(X_test[:9])
plt.figure(figsize=(10,10))

for i in range(9):
    plt.subplot(3,3,i+1)
    plt.imshow(X_test[i,0],cmap='gray')
    plt.gca().get_xaxis().set_ticks([])
    plt.gca().get_yaxis().set_ticks([])
    plt.ylabel("prediction=%d" % res[i],fontsize =10)