Xavier Initialization

I was reading code implementations of various GAN models and something very beautiful struck me. It was a very simple function used in all of those implementations called “Xavier Initialization”.

After some digging up in google I found out that its something worth writing for today. I have heard about Xavier initialization first in  cs234n lecture by Stanford. I didn’t search for it then, but today I have finally found it.

So what’s it?

Xavier Initialization is a method used to initialize neurons in Deep learning networks. When we start to model deep neural nets , the initialization values are always confusing. Proper initialization can get your network converged faster or getting stuck.

  • If the range of values are small then it might not reach deep into the network because it can fade away.
  • If the initial values are large then it might reach at the end with massive values which make it useless.

Xavier’s method of initialization take away these problems by  initializing the weights in network from a distribution with zero mean and a specific variance,

\text{Var}(W) = \frac{1}{n_\text{in}}

How it works?

As I said earlier, the Xavier’s method keeps the variance in all layers same. Let’s assume the neural network as

Y = W_1X_1 + W_2X_2 + \dotsb + W_n X_n

Let’s take the variance

Var(y) =Var(w1x1 + w2x2 + … + wNxN + b)

Since b is a constant ,its variance turns out to be 0


Note:If two variables are independent we can write the variance as below

{\begin{aligned}\operatorname {Var} (XY)&=[E(X)]^{2}\operatorname {Var} (Y)+[E(Y)]^{2}\operatorname {Var} (X)+\operatorname {Var} (X)\operatorname {Var} (Y).\end{aligned}}

Here E() is the Expectation. If the weights are from a Gaussian distribution , the mean is zero. So E() terms turn out to be zero.

\text{Var}(W_iX_i) = E[X_i]^2 \text{Var}(W_i) + E[W_i]^2 \text{Var}(X_i) + \text{Var}(W_i)\text{Var}(i_i)


\text{Var}(W_iX_i) = \text{Var}(W_i)\text{Var}(X_i)


substituting this equation we get,

\text{Var}(Y) = \text{Var}(W_1X_1 + W_2X_2 + \dotsb + W_n X_n)


If we assume that X and W are independent and distributed identically then,

\text{Var}(Y) = \text{Var}(W_1X_1 + W_2X_2 + \dotsb + W_n X_n) = n\text{Var}(W_i)\text{Var}(X_i)


Var(Y) is variance of output.  Var(X) is variance of input. From above equation it turns out that variance of output is   n*Var(W) times scaled value of variance of input.


To make the variance of output and input the same  n*Var(w) must be equal to 1.

\text{Var}(W_i) = \frac{1}{n} = \frac{1}{n_\text{in}}


Xavier initialization in Frameworks



W = tf.get_variable("W", shape=[784, 256],


weight_filler { type: "xavier" }

GAN -Generative Adversarial Network

“(GANs), and the variations that are now being proposed is the most interesting idea in the last 10 years in ML

These are the words of Yann LeCun, one of the very famous researchers in Deep Learning. GAN or Generative Adversarial Network is considered to be one the most promising developments in ML last year.

What is GAN?

If you have some idea about deep learning networks you will know about using them for various tasks like NLP, image recognition, object detection etc. Neural networks work extremely well in all these tasks. But does it mean it can do well in all areas? Imagine this situation. An artist is creating a masterpiece drawing. A neural network might be able to recognize objects in the image, but can we create a network which can create a great painting like that. Can a machine write beautiful articles and publish them. This is considered to be of great difficulty since we train the networks with existing data. The problem here is to make the network generate its own data. This is a Generative task.

The issue with the Generative task.

Imagine you are learning chess. If you want to improve your game you need a powerful opponent who does innovative moves. A chess player improves his game skills by analyzing his moves and finding the mistakes. If we are creating a model for chess we train it using many moves we manually prepare.

Image Generation Problem.

Image generation problem is where the machine learning model generates an image by itself. Imagine a machine creating a drawing as beautiful as Picasso. We give a set of images for training and the model generates the image while testing that it is similar to the training images, but are not same.


Here the issue is that we need a way to score the output image. If there are 2 output images, how can we say which one is better?

GAN proposes to use a neural network for this process. So in addition to the model, there is another neural network that scores the image output. The neural net that generates the image is called Generator and the one that scores the image is called the Discriminator.


Working of GAN?

As I said earlier there are 2 neural nets basically. Generator and Discriminator.


The generator has the duty of generating a novel image based on the patterns in the training samples.It uses the probability distribution to generate the similar patterns of the test set in the output image. We can see Generator as the function G(z). It takes the input z which is the sample from the probability distribution p(z).


Discriminator takes input from two sources. Real data and generated data from Generator. Its duty is to score the images into classes – Generated, Real. i.e it tells whether the input image is a real image or a generated image. It uses a sigmoid classifier to score the images

Training GAN

Training of GAN is done similar to two agents, playing with each other in a reinforcement learning setup. Here, Generator is trying to create an image that is indistinguishable by Discriminator and Discriminator is trying to discriminate the image from the real image. Like a game right?

Mathematically it is represented as

In the function V(D, G), the first term is the entropy of log likelihood of sample data from Probability Distribution of real data. Discriminator tries to maximize this value to 1. The second term is the entropy of generated image(fake image). The discriminator tries to maximize this to 0.


Applications of GAN

1. Increasing resolution of Image

2. Text to image generation

This is implemented with some modifications called Conditional GAN. Paper

3. Interactive Image Generation


Here I have tried to give a simple and brief idea about GAN.In the next post, let’s look at training a GAN more practically and let’s implement one in tensorflow.

Happy new year

It’s 2018. Happy new year to everyone. Last year has been amazing. Many memories and learnings. Planning to make a lot of crazy adventures this year. New year resolution of this year is to blog every day about the stuff I learn in the field of Artificial intelligence, Computing, and Mathematics. I have found that when I blog about a particular topic I am understanding it a lot better. I am extremely excited about all the new things I am gonna learn this year.

Wishing everyone a crazy and exciting year ahead.

Learning Convex Optimization

Starting to learn a very new topic. After undergrad maths classes , its been a while since I learned any new maths concepts. After going through many lectures on ML and deep learning I find that most of the problems are optimization problems. That’s when I got introduced to Convex optimization. Found this amazing book. […]

Building a Neural Net to recognize digits

Let’s build a neural net to recognize handwritten digits. This was a very difficult problem about 5 years back or so. I remember watching a programme in Discovery channel that showed how a post office used high tech cameras and software to sort out their  mails with handwritten pin codes. Today the same recognition is possible with accuracy of more than 95% with neural nets with less than 100 lines of code.

Difficulty level —- >  beginner

This is a very simple neural net to make using an ML Deep learning library.Also this task is considered as Hello world task in deep learning by experts.. Learning and understanding all the theory behind is a little bit difficult.

Tech Stack

I am using Keras python Library to build a convolution net that is fed with MNIST dataset.

List of tools and libraries used:

  1. Keras (Tensor flow backend)
  2. numpy
  3. matplot
  4. Spyder IDE
  5. Conda for virtual environment

Task 1- Setup environment.

Let’s start with installing and setting up all libraries needed for this task. Conda is used for this. It’s a powerful package management and collaboration tool. _images/conda_logo.svgYou can install all required packages in a conda environment. See this documentation for installation instructions

New environment can be created using the following commandconda

Here DL is the name I have given to the environment. You can give any name.We can create more such environment as we need and keep our system clean from different libraries. The environment needs to be activated. It can be done as follows
activate DL                           (if you are in windows)
source activate DL                    (if you are in Linux)  

Install libraries using the command

conda install

we need to install packages –> Tensor flow, numpy, keras, Matplotlib

Installing Spyder IDE

@spyder-ide Spyder IDE is best suited for data analysis, plotting etc. You can also write the code in a plain Ide like notepad or gedit. I am using spyder because of all the features it provides. The ide looks as below image


Adding the python packages to Spyder path.

Sometimes the path for all the libraries we installed inside conda environment is not detected by Spyder ide. You can add the path to Spyder by Tools -> Pythonpath manager and add the path.Update modules by Tools -> Update module list. Sometimes you need to close and start the Ipython console at bottom right to apply the changes.

Start coding.

We have done lots of setup now. Let’s start coding.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation,Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.utils import np_utils

As the first step let’s import all required libraries. We are using MNIST dataset. It’s already included with Keras library. As highlighted we are importing the dataset also to our python file.


mnist samplesMNIST dataset consist of images of handwritten digits. Each images is of dimension 28×28 and there are 6000 of them in the set. The image is represented as a 2D matrix. So the complete dimension will be 28x28x6000.

Importing Neural net layers from Keras


The convnet we are developing is similar to above picture. We need to import the layers from keras. If we were building it in pure python we need to build each layer from scratch. But keras comes preloaded with all the required layers. We are importing  Dense layer, Convolution layer(2D) , Max pooling layer ,Dropout layer

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np np.random.seed(923)
import keras from keras.datasets
import mnist from keras.models
import Sequential
from keras.layers.core import Dense,Dropout,Activation,Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.utils import np_utils


Setting up sizes

Here I am setting batch size as 128. That is the backpropagation algorithm will run as batches of size 128 each.

nb_classes variable is the number of classes that the dataset images belongs to. The neural net must predict one of these classes at the end if an image of a digit is given as input. Since our problem is to predict the digit , there are 10 classes (0-9)

image dimensions are set as 28×28. Since the images are represented as rows and cols in matrix we set img_rows and img_cols variable as 28

batch_size = 128

nb_classes = 10 # for 10 digits

#input image dimensions

img_rows , img_cols = 28,28

Loading the data

Now lets load the data. As i said earlier the data of images are in matrix form. We need to split the dataset to 2 types.

  1. Training data
  2. Testing data
Training Data

This is used to train our models. We define two variables X_train and y_train. X_train  will have the image matrix and y_train will have corresponding digit class

Testing data

This is used to test accuracy of our model.We have two variables for that as well. X_test,y_test

# load the data,
(X_train, y_train) , (X_test,y_test) = mnist.load_data()

If you are using spyder ide then variable explorer can be used to view the X_test and y_test variables.

Adding channel information

Here the dimension of the image right now is 6000x28x28. This does not have the channel information of the image.Color images have channel information for (R,G,B). Here we are reshaping the image matrix to add the channel information. This is required because the library function is expecting it. If we don’t add that information it may cause error. Since our images do not have color lets add 1 as channel .

channel first and channel last

The number of channels can be set in two orders in Keras. Either the channel number is given before the rows and cols information or we can give it at the end. This order can be configured in keras backend.

Kera stores all these configuration in a json file. In linux its stored in


NOTE for Windows Users: Please replace $HOME with %USERPROFILE%.

The default configuration file looks like this:

    "image_data_format": "channels_last",
    "epsilon": 1e-07,
    "floatx": "float32",
    "backend": "tensorflow"

Here we can set the backend for Keras to use. Either tensorflow or CNTK or Theano. The image_data_format is what we are interested in now. It specifies the order in which we must set the dimensions of input images. channels_last means the channel info is given at the end , channels_first means the opposite.

if K.image_data_format() == 'channels_first':
X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)

Categorize the classes

Each digit belongs to a class from 0 to 9. But our Neural network cannot understand the classes as a digit. This kind of data is categorical data. We need to categorize this 10 classes. To do that we make 10 columns .each indicating one of the classes. If an image belongs to a particular task our model sets one to that particular column.

To categorize the data , Keras provides to_categorical function

Y_train = np_utils.to_categorical(y_train,nb_classes) 
Y_test = np_utils.to_categorical(y_test,nb_classes)

Creating the model

Now all the preprocessing is complete. Let’s start to build our model. Keras provides layers class with all standard layers we need to create a convolution network

 model =Sequential()
 model.add(Convolution2D(6,5,5,input_shape =input_shape,border_mode='same'))
 model.add(Activation ('relu'))

Sequential() object is initialized to create a sequential model. add() function is used to stack new layers to the sequential model

model.add(Convolution2D(6,5,5,input_shape =input_shape,border_mode=’same’))

  • We are using 6, 5×5 filters in this convolution layer. That’s the first 3 parameters.
  • input_shape is the dimension of images given as input.In keras we only need to give the input size for the first layer.All other layers can automatically find the dimensions.
  • border_mode =’same’ sets all images to be of same size.More info can be found here


  • This is an activation Layer. The activation function is ‘Relu‘. We are also using a ‘softmax’ function at the end of our model.




Compiling the model

Till now we have designed our model. We have not yet compiled it.


We need to use compile() function to compile the model. We are giving the loss function and optimizer algorithm as parameters. ‘categorical_crossentropy‘ is used here as the loss function. ‘adadelta‘ is used for backpropogation.


Fitting the model

Once compilation is done , the next step is to fit the model. Here the fit() function is used in keras. We need to give the training(X_train,Y_train) data, batch_size, number of epochs and test data(X_test,Y_test) as input


This is the output.we can see the model getting trained.


Evaluation of our model

Our model is complete and is trained. Let’s try to valuate it.

score = model.evaluate(X_test,Y_test,verbose=0)

evaluate() function produces a score for our prediction


res =model.predict_classes(X_test[:9])

Image result for mnist prediction

Now our model is complete. We can now predict a handwritten digit image using our model. predict_classes() function is used to predict. The output is plotted as above.

Final code

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
Created on Tue Nov 21 08:16:25 2017

@author: akshaynathr

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense,Dropout,Activation,Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.utils import np_utils
import keras.backend as K
batch_size = 128
nb_classes = 10 # for 10 digits

#input image dimensions
img_rows , img_cols = 28,28

# load the data,
(X_train, y_train) , (X_test,y_test) = mnist.load_data()

if K.image_data_format() == 'channels_first':
X_train = X_train.reshape(X_train.shape[0], 1, img_rows, img_cols)
X_test = X_test.reshape(X_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)
X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
X_train /= 255
X_test /= 255

print("X_train shape:" , X_train.shape)
print("X_test_shape:" , X_test.shape)

Y_train = np_utils.to_categorical(y_train,nb_classes)
Y_test = np_utils.to_categorical(y_test,nb_classes)

print("One hot encoding: {}".format(Y_train[0,:]))

for i in range(9):
plt.imshow(X_train[i,0], cmap ='gray')

model =Sequential()
model.add(Convolution2D(6,5,5,input_shape =input_shape,border_mode='same'))

model.add(Activation ('relu'))


nb_epoch =2
score = model.evaluate(X_test,Y_test,verbose=0)
print("Test score:",score[0])
print("Test accuracy",score[1])

res =model.predict_classes(X_test[:9])

for i in range(9):
    plt.ylabel("prediction=%d" % res[i],fontsize =10)

Resources to get started in Machine learning

I have come across many people asking how to get started in Machine learning. I will try to put down the resources that I have used and the ones that I  use now to learn more about ML.

Paid and Free

There are two types of resources available online. Paid or Free. I prefer the Free materials and will be mentioning more about that here.Free resources  gets you knowledge only, but no certificates to prove that. If you can spend money and you want certificates for the learning you did, you can also find paid courses in Udacity and Coursera. I personally found coursera materials to be cheaper than that of Udacity.



Knowledge of maths is really important. I initially thought I can learn ML just by learning libraries and API they provide. This was the way I learned web development. But in ML you will surely reach a blocker if you proceed without learning sufficient maths. A clear understanding of undergraduate maths is sufficient to get started. Main areas you need are the follows

  1. Linear algebra
  2. Calculus 
  3. Probability

These topics are enough for anyone to understand the basic stuff and start on ML.If you are not confident in those areas there are many excellent tutorials online on youtube that covers them.

Resources: MIT Calculus lectures,  Probability Lectures,  NPTEL Linear algebra lectures, Khan academy Linear algebra

Getting startedMortgage+Basics+Langley

You can learn about ML  in two ways. Either by implementing first and learning the theory later or learning theory first and doing practicals later. When I started out there were very less practical examples available on the internet. So I started by learning theorems and later trying to implement them.

Andrew NG is your superhero when it comes to beginner lectures in ML.  You can find his lectures both on Coursera and Youtube. CS229 here you can find the notes too.

An Introduction to Statistical Learning” is an excellent book, to get started.But it’s very costly to get it in India.If you have 3.5k rupees to spent on this book, you can buy it here

“Intro to machine learning ”  is a course offered by Udacity. It teaches the very basic ideas on ML. I recommend everyone to see this lecture first. Its pretty basic and also teaches ideas well. You can find the course in Udacity or on Youtube


This lecture from MIT by Patrik Wilson is one of the best lectures I have seen on SVM. He makes it crystal clear and easy to understand. You can find the complete playlist here


Neural networks.

You will surely learn about neural nets and deep learning. There are many good lectures on neural nets that you must watch. Hugo Larochelle has created this really nice series of tutorials on neural nets. It covers almost everything from basics to deep learning.

CS231n is a famous Deep learning course by Standford. Its taken by Andrej karpathy, a very famous name like Andrew Ng. You can find excellent explanations on Convnets, RNN, LSTM etc in this course.

There are Siraj Rival’s tutorials also.  I have not watched many of his tutorials. He tries to teach ML in a fun and entertaining way.

Mathematicalmonk is a very famous maths youtube channel. They have started ML series also. The specialty with these series is that there are a lot of mathematical details to ML theories. You can also learn probability and other maths stuffs from here.

Deeplearning.io is an initiative by Andrew NG. It is a set of 4 courses  on deeplearning available in coursera. The videos are available for free on youtube.


Once you are comfortable with basics you can start implementing on your own.


Choosing a programming language for ML is a personal choice. Python and R are the most used languages. I prefer to use python. R is also a very easy language to learn. Both of them have a ton of libraries that suit all needs for ML.Datacamp has a very nice R tutorial that is free. I learned R from there. Since I use python , I only know about python libraries for ML.


Numpy is a python library that provides support for multi-dimensional arrays. Its widely used while creating neural nets etc. Numpy is very easy to learn and also is very powerful.

Here is a nice tutorial for numpy from Dataquest.io 



Pandas is used to play with data. You can manipulate and retrieve information from different types of data with Pandas dataframe.

Datacamp provides a course for Pandas. Only first chapter is free.                    Hackerearth has a nice tutorial on numpy andM pandas here


Visualization is important to understand relations in data. Matplotlib is a fantastic python library for all visualization needs.

Datacamp matplotlib tutorial article


Tensorflow is a google library used to build neural net models. Its created by Google and has great support. Datacamp has a tutorial blog on Tensorflow here. You can easily get started with basic concepts.

Here’s a nice set of videos on TF.                                                                                              Tutorials in Tensorflow site.


Useful sites and Competitions


When it comes to competitions in data science Kaggle is the first name that comes up. Kaggle has connected the data scientists all over the world and comes up with very challenging data science competitions with heavy prizes. Also, there are many tutorials available in Kaggle about data science.

Analytics Vidya

Analytics Vidya is another useful website. It is an Indian website that offers a wide range of tutorials in datascience, NLP, neural nets etc. They also conduct competitions on datascience. I have used this website extensively to learn things


Topcoder also provides many tutorials and competitions in data science. I cannot comment more since I have not tried any of them yet.


This is also very useful site. I have found some interesting tutorials here.


 A very famous site with many useful uptodate articles in ML. You can also signup for the newsletter.

Colah’s blog

Ruder’s blog

Gradient Descent Algorithm.

Gradient Descent is a very important algorithm that is used in Machine learning. There are numerous variations of GD and is an active research area. This post is part of the notes that I wrote down while learning various variations of Gradient Descent. It is not complete. I will be adding more details as I explore more interesting parts of Gradient Descent.


What’s Gradient Descent?

Image result for bowl

In simple terms, Gradient Descent is an algorithm that is used to find the minimum of a function.

You can imagine the function to be like the bowl here. We need to find the minimum of the function, which is the bottom of the bowl.

Imagine you roll a small spherical ball from the top. It follows the most curved path and reaches the bottom. Gradient descent also works the same way. It follows the negative gradient(slope) and tries to reach a minimum of the function.

You can see the algorithm moving along the path to reach the minimum in this picture.









Python Implementation.

Let’s try to find the minimum of a function using Gradient Descent. (This example is taken from wiki page for Gradient descent)

The gradient descent algorithm is applied to find a local minimum of the function f(x)=x4−3x3+2, with derivative f'(x)=4x3−9x2

# From calculation, it is expected that the local minimum occurs at x=9/4

cur_x = 6 # The algorithm starts at x=6
gamma = 0.01 # step size multiplier
precision = 0.00001
previous_step_size = cur_x

def df(x):
    return 4 * x**3 - 9 * x**2

while previous_step_size > precision:
    prev_x = cur_x
    cur_x += -gamma * df(prev_x)
    previous_step_size = abs(cur_x - prev_x)

print("The local minimum occurs at %f" % cur_x)


Mathematical representation

Image result for gradient descent formulaHere  Slope Parameter(alpha) is known as learning rate in neural network training.

is called as the cost function. It is the output value computed by the neural networks, which we try to minimize.




 Stochastic gradient descent (SGD)

Vanilla gradient descent runs over the whole dataset to find the next update for the parameters. When it comes to large datasets this approach might become intractable in a machine. So Stochastic gradient descent is a slight modification of the vanilla gradient descent algorithm.


Vanilla descent estimates the expectation of the function with all items in the dataset. SGD uses a single sample to update the values


Mini batch Gradient Descent

Simple samples used by gradient descent has a lot of noise in the update. So a small subset of the samples is used as mini-batch. A common mini batch size in 256. Mini batches tend to average a little of the noise out.

Gradient is calculated with a mini-sample