Skip to main content

Perceptron Algorithm | Pegasos Algorithm | Support Vector Machine | Linear Classifier Among Dataset | Deep Learning | Machine Learning

PERCEPTRON ALGORITHM

Perceptron Algorithm is one of the most used algorithm in classifying data.It frequently seems little tough while learning first time.

In today's post , We'll try to understand it well in brief with full explanation. 

Perceptron Algorithm is used whenever given dataset can be classified into two parts or it has only two labels.(One can consider it as +1, -1)

There are three version of Perceptron Algorithm:

  1. Simple Perceptron 
  2. Average Percepron
  3. Pegasos Algorithm or Support Vector  Machine
Relax! Terms are sounding dangerous but they are really very easy.Let me explain you.


1.Simple Perceptron:

The single target of perceptron algorithm is to find a Linear classifier( say Linear equation) on one side of which are all positively labelled point and on another side all negatively labelled points are there.

As we know any linear equation can be written as,

Y=θ.X + θ0

Using Perceptron Our Aim is to find those value of θ vector and θ0 , with which formed line divide our dataset into two part , one part of which will be positively labelled and another part will be negatively labelled.See below pic:


  • It is also an iterative method.We'll start with θ = 0 vector and θ0=0
  • In each iteration ,for every data point ,

        •  if ((yi)*(θ.xi0)<=0) :
          • Update (θ=θold + (yi.xi))
                           (θ0+ yi)

Reason For above updates can be derived using stochastic gradient descent algorithm.We'll derive above updation.If you are not familiar with Stochastic Gradient Descent ,Click Below Link.

2. Averaged Perceptron :

               In Averaged Perceptron , Our Averaged Perceptron Function returns the average value of θ and θ0 ,updated in every iteration.While in "Only Perceptron Algorithm" , We return the last updated value of θ and θ0 .

See the Below Code of a sample implementation


 

3. Pegasos Algorithm/Support Vector Machine:

In above two algorithm, we have taken care of getting best linear classifier, Which can well classify our dataset, but what if there are many classifier existing that can classify the dataset in two part.See Below pic

In above picture, It can be seen quite clearly that all the classifier line is well classifying our training set.

If you think little wisely, you can say the classifier with a high margin boundary will be said to best fit.


In such a case we should choose, which have large margin(d) on both side (positively labelled and negatively labelled) . 

Now, there are two things :
  1. Loss : Loss on misclassifying a data
  2. Regularization : Responsible for maintaining large boundary
Remember we are concerned only about its correct classification.If our classifier classifies any point wrongly , we will give some penalty to our classifier.
Also Notice if any point will be classified correctly and not on decision boundary, then :

label*Ypredicted >0

Since we want no point inside margin boundary too, We'll give penalty, when it will inside margin boundary too.So there will be no loss if :

(yi.Xi0))>=1


Otherwise we'll give some penalty to our model, this is known as Hinge Loss, and it is defined as:

H(z)=max{0,1-z}, z=(yi.Xi+θ0))

means , Loss will be zero if z>1 else 1-z

So, Our Objective function will be 

J = Loss + Regularization

J(θ,θ0) = (1/n)[Σin=1H(yi.Xi0)) + (λ/2)||θ||]

Our target is to find those value of θ,θ0 For which value of objective function is least or we have to minimize Objective function.We'll do it using Stochastic gradient descent Algorithm

So, in each iteration we will consider only one point and will calculate cost only for that and update.

J(θ,θ0) = H(yi.Xi0i)) + (λ/2)||θ||
Since, ∂H(z)/∂θ = {0 if z>1 ,-∂z/∂θ otherwise}
So,the iteration formula from stochastic Gradient Descent:
if z<=1:

θnewold - η[-yi.x+λ.θold]
θnewold +ηyi.xi  -η.λ.θold
θonewoold  η.yi

else:

θnewold  λ.θold
θonewoold 
We will make η getting decreased as iteration will increase.
Try to implement and Uderstand Below code

If we remove the regularization term , and take η=1 , it becomes Simple Perceptron Algorithm.













Comments

Popular posts from this blog

Gradient Descent Algorithm Fully Explained| Machine Learning

GRADIENT DESCENT ALGORITHM When one start Learning For Machine Learning, As a beginner one founds a very complex explanation of Gradient Descent Algorithm and since it is a very important algorithm in machine Learning,understanding it, is also much important. Today , in this article we will try to understand Gradient Descent   in very easy, brief and well explained way. Gradient Descent  is the most common optimization algorithm in  machine learning  and  deep learning . It is a first-order optimization algorithm. This means it only takes into account the first derivative when performing the updates on the parameters. If we start searching for Gradient Descent Algorithm ,we found this picture. Don't panic! Let me explain you z is a cost function of x and y.We have to find those value of x and y for which value of cost function (z) is minimum. If we visualize it, splitted in two part (z VS x for any fix value of y) and (z VS y for any fix value ...

Stochastic Gradient Descent Algorithm Fully Explained | Machine Learning

STOCHASTIC GRADIENT DESCENT STOCHASTIC GRADIENT DESCENT  is an efficient algorithm over GRADIENT DESCENT when it requires to deal with BIG DATA.Where Data are huge STOCHASTIC GRADIENT DESCENT is used. In our Previous post, We already discussed about GRADIENT DESCENT ( Click Here ) very well.In this post, we will try to understand STOCHASTIC GRADIENT DESCENT.  Both are almost same , only difference comes while iterating: In Gradient Descent ,We had four things Feature Vector(X) Label(Y) Cost function(J) Predicted Value( Y p) θ was representing the coefficient/Weightage vector for feature vector,  θ 0   Offset Parameter Y p =θ.X+θ 0 θ new =θ old -(η*∂J/∂θ) The Single Difference between Gradient Descent and Stochastic Gradient Descent comes while iterating: In Gradient Descent ,   We sum up the losses over all the data points given and take average in our cost function, Something like this: J=(1/n)ΣLoss(Y i ,Y p i ) ∂J/∂θ=(1...