Naive Bayes Classifier
A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem.
Bayes Theorem:

Using Bayes theorem, we can find the probability of A happening,
given that B has
occurred. Here, B is
the evidence and A is
the hypothesis. The assumption made here is that the predictors/features are
independent. That is presence of one particular feature does not affect the
other. Hence it is called naive.
Lets explore the
parts of Bayes Theorem:
·
P(A|B) - Posterior Probability
o The
conditional probability that event A occurs given that event B has occurred.
·
P(A) - Prior Probability
o The
probability of event A.
·
P(B) - Evidence
o The
probability of event B.
·
P(B|A) - Likelihood
o The
conditional probability of B occurring given event A has occurred.
Now, lets explore
the parts of Bayes Theorem through the eyes of someone conducting machine
learning:
·
P(A|B) - Posterior Probability
o The
conditional probability of the response variable (target variable) given the
training data inputs.
·
P(A) - Prior Probability
o The
probability of the response variable (target variable).
·
P(B) - Evidence
o The
probability of the training data.
·
P(B|A) - Likelihood
o The
conditional probability of the training data given the response variable.
Example 1:
Let us take an example to get some better intuition. Consider the
problem of playing golf. The dataset is represented as below.
Day |
Outlook |
Temperature |
Humidity |
Wind |
Play Tennis |
D1 |
Sunny |
Hot |
High |
Weak |
No |
D2 |
Sunny |
Hot |
High |
Strong |
No |
D3 |
Overcast |
Hot |
High |
Weak |
Yes |
D4 |
Rain |
Mild |
High |
Weak |
Yes |
D5 |
Rain |
Cool |
Normal |
Weak |
Yes |
D6 |
Rain |
Cool |
Normal |
Strong |
No |
D7 |
Overcast |
Cool |
Normal |
Strong |
Yes |
D8 |
Sunny |
Mild |
High |
Weak |
No |
D9 |
Sunny |
Cool |
Normal |
Weak |
Yes |
D10 |
Rain |
Mild |
Normal |
Weak |
Yes |
D11 |
Sunny |
Mild |
Normal |
Strong |
Yes |
D12 |
Overcast |
Mild |
High |
Strong |
Yes |
D13 |
Overcast |
Hot |
Normal |
Weak |
Yes |
D14 |
Rain |
Mild |
High |
Strong |
No |
(Outlook=Sunny,
Temperature=Cool, Humidity=High, Wind=Strong)
Steps:
1. Convert
the data set into a frequency table---Prior Probability
2. Create
Likelihood table by finding the probabilities—Conditional Probabilities
3.
Use Naive
Bayesian equation to calculate the posterior probability
Step
1: Prior Probability:
For Yes: Total 14 and 9 Yes
P=(Play
tennis=Yes)=9/14=0.64
For No: Total 14 and 5 No
P=(Play
tennis=No)=5/14=0.36
Outlook
In the first attribute which is outlook we have three categorical values which are (sunny, overcast, rainy ) and our target is that we find each of the probability for Sunny, Overcast and Rainy based on target attribute which is play (yes, no).
Temperature:
In the second attribute
which is Temperature we have three categorical values which are (Hot, mild,
cool) and our target is that we find each of the probability for Hot, mild and
cool based on target attribute which is play (yes, no).
Temperature |
Yes |
No |
Hot |
2/9 |
2/5 |
Mild |
4/9 |
2/5 |
Cool |
3/9 |
1/5 |
Humidity:
In the third attribute which is Humidity we have two categorical values which are ( High, Normal) and our target is that we find each of the probability for High, Normal based on target attribute which is play (yes, no).
Humidity |
Yes |
No |
High |
3/9 |
4/5 |
Normal |
6/9 |
1/5 |
Wind:
In the fourth attribute which is Wind we have two categorical values which are ( False, True) and our target is that we find each of the probability for False, True based on target attribute which is play (yes, no).
Wind |
Yes |
No |
Strong |
3/9 |
3/5 |
Weak |
6/9 |
2/5 |
Play:
In the Last attribute which is predicted or Target, we have binary value (Yes, No) yes mean play and no mean not play and we find the probability for yes from the total instance which is 14 and no also.
Outlook |
Temperature |
Humidity |
Wind |
Play Tennis |
Sunny |
Cool |
High |
strong |
? |
Based on yes probability:
P(X/C) P(C) or P(X/play= No)P(play=No)
P(X/play= No)P(play=No)= 0.0206
So Finally
For no (Should watch movies)
In the End
0.9421 > 0.2424
So the probability for no is highest as
compare to yes.
Our predicted result is
Outlook |
Temperature |
Humidity |
Wind |
Play Tennis |
Sunny |
Cool |
High |
True |
No |
Example 2:
ExamExample 3:
New Instance=(Red,
SUV, Domestic)
ior Prior Probabilities:
P(Yes)=5/10
P(Yes\Newinstance)=P(yes)*P(color=red
\Yes)*(P(Type=SUV/Yes)*P(Origin=Domestic/yes)
P(Yes\New
instance)=0.5*3/5*1/5*2/5=0.024
P(No\New
instance)=P(No)*P(color=red /NO)*(P(Type=SUV/No)*P(Origin=Domestic/No)
P(No\New
instance)=0.5*2/5*3/5*3/5=0.072
P(Yes\New
instance)< P(No\New instance)
The New instance is
Classified to NO
Advantages
of Naive Bayes Classifier:
• It is simple and easy to implement.
• It doesn't require as much training data.
• It handles both continuous and discrete data.
• It is highly scalable with the number of
predictors and data points.
• It is fast and can be used to make real-time
predictions.
Pros:
·
It is easy and
fast to predict class of test data set. It also perform well in multi class
prediction
·
When assumption of
independence holds, the classifier performs better compared to other machine learning models like logistic regression or decision tree, and requires less
training data.
·
It perform well in
case of categorical input variables compared to numerical variable(s). For
numerical variable, normal distribution is assumed (bell curve, which is a
strong assumption).
Cons:
·
If categorical
variable has a category (in test data set), which was not observed in training
data set, then model will assign a 0 (zero) probability and will be unable
to make a prediction. This is often known as “Zero Frequency”. To solve this,
we can use the smoothing technique. One of the simplest smoothing techniques is
called Laplace estimation.
·
On the other
side, Naive Bayes is also known as a bad estimator,
so the probability outputs from predict_proba are not to be taken too
seriously.
·
Another limitation
of this algorithm is the assumption of independent predictors. In real life, it
is almost impossible that we get a set of predictors which are completely
independent.
Applications of
Naive Bayes Algorithms
·
Real-time Prediction: Naive Bayesian classifier is an eager
learning classifier and it is super fast. Thus, it could be used for
making predictions in real time.
·
Multi-class Prediction: This algorithm is also well known for
multi class prediction feature. Here we can predict the probability of multiple
classes of target variable.
·
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayesian classifiers mostly
used in text classification (due to better result in multi class problems and
independence rule) have higher success rate as compared to other algorithms. As
a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify
positive and negative customer sentiments)
·
Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and
data mining techniques to filter unseen information and predict whether a user
would like a given resource or not.
Types of NB Models:
·
Gaussian Naive Bayes: gaussiannb is
used in classification tasks and it assumes that feature values follow a
Gaussian distribution.
·
Multinomial Naive Bayes: It is used for discrete counts. For
example, let’s say, we have a text classification problem. Here we can
consider Bernoulli trials which is one step further and instead of “word
occurring in the document”, we have “count how often word occurs in the
document”, you can think of it as “number of times outcome number x_i is
observed over the n trials”.
·
Bernoulli Naive Bayes: The binomial model is useful if your
feature vectors are boolean (i.e. zeros and ones). One application would be
text classification with ‘bag of words’ model where the 1s & 0s are “word occurs
in the document” and “word does not occur in the document” respectively.
·
Complement Naive Bayes: It is an
adaptation of Multinomial NB where the complement of each class is used to
calculate the model weights. So, this is suitable for imbalanced data sets and often outperforms the MNB on
text classification tasks.
·
Categorical Naive Bayes: Categorical
Naive Bayes is useful if the features are categorically distributed. We have to
encode the categorical variable in the numeric format using the ordinal encoder
for using this algorithm.
Comments
Post a Comment