Skip to main content

Perceptron Algorithm | Pegasos Algorithm | Support Vector Machine | Linear Classifier Among Dataset | Deep Learning | Machine Learning

PERCEPTRON ALGORITHM

Perceptron Algorithm is one of the most used algorithm in classifying data.It frequently seems little tough while learning first time.

In today's post , We'll try to understand it well in brief with full explanation. 

Perceptron Algorithm is used whenever given dataset can be classified into two parts or it has only two labels.(One can consider it as +1, -1)

There are three version of Perceptron Algorithm:

  1. Simple Perceptron 
  2. Average Percepron
  3. Pegasos Algorithm or Support Vector  Machine
Relax! Terms are sounding dangerous but they are really very easy.Let me explain you.


1.Simple Perceptron:

The single target of perceptron algorithm is to find a Linear classifier( say Linear equation) on one side of which are all positively labelled point and on another side all negatively labelled points are there.

As we know any linear equation can be written as,

Y=θ.X + θ0

Using Perceptron Our Aim is to find those value of θ vector and θ0 , with which formed line divide our dataset into two part , one part of which will be positively labelled and another part will be negatively labelled.See below pic:


  • It is also an iterative method.We'll start with θ = 0 vector and θ0=0
  • In each iteration ,for every data point ,

        •  if ((yi)*(θ.xi0)<=0) :
          • Update (θ=θold + (yi.xi))
                           (θ0+ yi)

Reason For above updates can be derived using stochastic gradient descent algorithm.We'll derive above updation.If you are not familiar with Stochastic Gradient Descent ,Click Below Link.

2. Averaged Perceptron :

               In Averaged Perceptron , Our Averaged Perceptron Function returns the average value of θ and θ0 ,updated in every iteration.While in "Only Perceptron Algorithm" , We return the last updated value of θ and θ0 .

See the Below Code of a sample implementation


 

3. Pegasos Algorithm/Support Vector Machine:

In above two algorithm, we have taken care of getting best linear classifier, Which can well classify our dataset, but what if there are many classifier existing that can classify the dataset in two part.See Below pic

In above picture, It can be seen quite clearly that all the classifier line is well classifying our training set.

If you think little wisely, you can say the classifier with a high margin boundary will be said to best fit.


In such a case we should choose, which have large margin(d) on both side (positively labelled and negatively labelled) . 

Now, there are two things :
  1. Loss : Loss on misclassifying a data
  2. Regularization : Responsible for maintaining large boundary
Remember we are concerned only about its correct classification.If our classifier classifies any point wrongly , we will give some penalty to our classifier.
Also Notice if any point will be classified correctly and not on decision boundary, then :

label*Ypredicted >0

Since we want no point inside margin boundary too, We'll give penalty, when it will inside margin boundary too.So there will be no loss if :

(yi.Xi0))>=1


Otherwise we'll give some penalty to our model, this is known as Hinge Loss, and it is defined as:

H(z)=max{0,1-z}, z=(yi.Xi+θ0))

means , Loss will be zero if z>1 else 1-z

So, Our Objective function will be 

J = Loss + Regularization

J(θ,θ0) = (1/n)[Σin=1H(yi.Xi0)) + (λ/2)||θ||]

Our target is to find those value of θ,θ0 For which value of objective function is least or we have to minimize Objective function.We'll do it using Stochastic gradient descent Algorithm

So, in each iteration we will consider only one point and will calculate cost only for that and update.

J(θ,θ0) = H(yi.Xi0i)) + (λ/2)||θ||
Since, ∂H(z)/∂θ = {0 if z>1 ,-∂z/∂θ otherwise}
So,the iteration formula from stochastic Gradient Descent:
if z<=1:

θnewold - η[-yi.x+λ.θold]
θnewold +ηyi.xi  -η.λ.θold
θonewoold  η.yi

else:

θnewold  λ.θold
θonewoold 
We will make η getting decreased as iteration will increase.
Try to implement and Uderstand Below code

If we remove the regularization term , and take η=1 , it becomes Simple Perceptron Algorithm.













Comments

Popular posts from this blog

What is Kernel Function | Fully Explained

Kernel Function In this post, We'll know what exactly is kernel function. Kernel Function is used to transform n - dimensional input to m- dimensional input , where m is much higher than n then find the dot product in higher dimensional efficiently.It also helps sometimes to do calculation easily in infinite dimensional space without going to infinite dimensions. Mathematical definition :  K(x, y) = <f(x), f(y)> . Here K is the kernel function, x, y are n dimensional inputs. f is a map from n-dimension to m-dimension space. < x,y> denotes the dot product. usually m is much larger than n. Intuition :  Normally calculating <f(x), f(y)> requires us to calculate f(x), f(y) first, and then do the dot product. These two computation steps can be quite expensive as they involve manipulations in m dimensional space, where m can be a large number. But after all the trouble of going to the high dimensional space, the result of the dot product is really a scala

Linear Regression(With Gradient Descent ) Fully Explained| Machine Learning

Linear Regression (Gradient Descent Method) If you don't have any idea of Gradient Descent Algorithm , please check our previous post , there I have explained Gradient Descent Algorithm very well explained in brief. Now moving toward our current topic Linear Regression . In the previous post , we have just discussed the theory behind the Gradient Descent . Today we will learn Linear Regression where we will use  Gradient Descent to minimize the cost function. WHAT IS LINEAR REGRESSION: Suppose you are given a equation: y=2x1+3x2+4x3+1 and you are said to find the value at any point (1,1,1) corresponds to x1, x2, x3 respectively. You'll simply put the value of x1, x2, x3 into equation and tell me the answer :10,Right? But What if you are given different set of (x1, x2, x3,y) and you are said to find the equation. Here's what,Linear Regression Comes into picture.It helps us to find out or fit a Linear equation to datasets  given. Above equation can be easily tra