Naive Bayes Classifier

 A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem.

Bayes Theorem:

Using Bayes theorem, we can find the probability of A happening, given that B has occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that the predictors/features are independent. That is presence of one particular feature does not affect the other. Hence it is called naive.

Lets explore the parts of Bayes Theorem:

·         P(A|B) - Posterior Probability

o    The conditional probability that event A occurs given that event B has occurred.

·         P(A) - Prior Probability

o    The probability of event A.

·         P(B) - Evidence

o    The probability of event B.

·         P(B|A) - Likelihood

o    The conditional probability of B occurring given event A has occurred.

Now, lets explore the parts of Bayes Theorem through the eyes of someone conducting machine learning:

·         P(A|B) - Posterior Probability

o    The conditional probability of the response variable (target variable) given the training data inputs.

·         P(A) - Prior Probability

o    The probability of the response variable (target variable).

·         P(B) - Evidence

o    The probability of the training data.

·         P(B|A) - Likelihood

o    The conditional probability of the training data given the response variable.


 

Example 1:

Let us take an example to get some better intuition. Consider the problem of playing golf. The dataset is represented as below.

 

Day

Outlook

Temperature

Humidity

Wind

Play Tennis

D1

Sunny

Hot

High

Weak

No

D2

Sunny

Hot

High

Strong

No

D3

Overcast

Hot

High

Weak

Yes

D4

Rain

Mild

High

Weak

Yes

D5

Rain

Cool

Normal

Weak

Yes

D6

Rain

Cool

Normal

Strong

No

D7

Overcast

Cool

Normal

Strong

Yes

D8

Sunny

Mild

High

Weak

No

D9

Sunny

Cool

Normal

Weak

Yes

D10

Rain

Mild

Normal

Weak

Yes

D11

Sunny

Mild

Normal

Strong

Yes

D12

Overcast

Mild

High

Strong

Yes

D13

Overcast

Hot

Normal

Weak

Yes

D14

Rain

Mild

High

Strong

No

(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)


Steps:

1.   Convert the data set into a frequency table---Prior Probability

2.   Create Likelihood table by finding the probabilities—Conditional Probabilities

3.   Use Naive Bayesian equation to calculate the posterior probability

Step 1: Prior Probability:

For Yes: Total 14 and 9 Yes

P=(Play tennis=Yes)=9/14=0.64

For No: Total 14 and 5 No

P=(Play tennis=No)=5/14=0.36

 

Outlook

In the first attribute which is outlook we have three categorical values which are (sunny, overcast, rainy ) and our target is that we find each of the probability for Sunny, Overcast and Rainy based on target attribute which is play (yes, no).


 

Temperature:

In the second attribute which is Temperature we have three categorical values which are (Hot, mild, cool) and our target is that we find each of the probability for Hot, mild and cool based on target attribute which is play (yes, no).

    

Temperature

Yes

No

Hot

2/9

2/5

Mild

4/9

2/5

Cool

3/9

1/5

 

Humidity:

In the third attribute which is Humidity we have two categorical values which are ( High, Normal) and our target is that we find each of the probability for High, Normal based on target attribute which is play (yes, no).

 

Humidity

Yes

No

High

3/9

4/5

Normal

6/9

1/5

Wind:

In the fourth attribute which is Wind we have two categorical values which are ( False, True) and our target is that we find each of the probability for False, True based on target attribute which is play (yes, no). 

Wind

Yes

No

Strong

3/9

3/5

Weak

6/9

2/5

 Play:

In the Last attribute which is predicted or Target, we have binary value (Yes, No) yes mean play and no mean not play and we find the probability for yes from the total instance which is 14 and no also. 

Outlook

Temperature

Humidity

Wind

Play Tennis

Sunny

Cool

High

strong

?

 Based on yes probability:

 P(X/C) P(C) or P(X/play= yes)P(play=yes)

 

P(X/C) P(C) or P(X/play= No)P(play=No)

 P(X/play= No)P(play=No)=3/5 * 1/5 *4/5 *3/5 * 5/14

P(X/play= No)P(play=No)= 0.0206

So Finally 

For yes (Should play Tennis)

For no (Should watch movies)


In the End

0.9421 > 0.2424

So the probability for no is highest as compare to yes.

Our predicted result is

Outlook

Temperature

Humidity

Wind

Play Tennis

Sunny

Cool

High

True

No


 Example 2:





ExamExample 3:

          

        

            New Instance=(Red, SUV, Domestic)

 ior        Prior Probabilities:

  P(Yes)=5/10

    P(No)=5/10

P(Yes\Newinstance)=P(yes)*P(color=red \Yes)*(P(Type=SUV/Yes)*P(Origin=Domestic/yes)

P(Yes\New instance)=0.5*3/5*1/5*2/5=0.024

P(No\New instance)=P(No)*P(color=red /NO)*(P(Type=SUV/No)*P(Origin=Domestic/No)

P(No\New instance)=0.5*2/5*3/5*3/5=0.072

P(Yes\New instance)< P(No\New instance)

The New instance is Classified to NO

Advantages of Naive Bayes Classifier:

      It is simple and easy to implement.

      It doesn't require as much training data.

      It handles both continuous and discrete data.

      It is highly scalable with the number of predictors and data points.

      It is fast and can be used to make real-time predictions.

Pros:

·         It is easy and fast to predict class of test data set. It also perform well in multi class prediction

·         When assumption of independence holds, the classifier performs better compared to other machine learning models like logistic regression or decision tree, and requires less training data.

·         It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

Cons:

·         If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.

·         On the other side, Naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.

·         Another limitation of this algorithm is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

Applications of Naive Bayes Algorithms

·         Real-time Prediction: Naive Bayesian classifier is an eager learning classifier and it is super fast. Thus, it could be used for making predictions in real time.

·         Multi-class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable.

·         Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayesian classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)

·         Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not.

Types of NB Models:

·         Gaussian Naive Bayes: gaussiannb is used in classification tasks and it assumes that feature values follow a Gaussian distribution.

·         Multinomial Naive Bayes: It is used for discrete counts. For example, let’s say,  we have a text classification problem. Here we can consider Bernoulli trials which is one step further and instead of “word occurring in the document”, we have “count how often word occurs in the document”, you can think of it as “number of times outcome number x_i is observed over the n trials”.

·         Bernoulli Naive Bayes: The binomial model is useful if your feature vectors are boolean (i.e. zeros and ones). One application would be text classification with ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.

·         Complement Naive Bayes: It is an adaptation of Multinomial NB where the complement of each class is used to calculate the model weights. So, this is suitable for imbalanced data sets and often outperforms the MNB on text classification tasks.

·         Categorical Naive Bayes: Categorical Naive Bayes is useful if the features are categorically distributed. We have to encode the categorical variable in the numeric format using the ordinal encoder for using this algorithm.

 

Comments

Popular posts from this blog

Linear Regression

Support Vector Machines- I

Logistic Regression