k-Nearest Neighbors Algorithm

• K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised Learning techniques.

• K-NN algorithm assumes the similarity between the new case/data and available cases and puts the new case into the category that is most similar to the general categories.

• K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears it can be easily classified into a well-suited category by using the K-NN algorithm.

• K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for Classification problems

• K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.

• It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset.

• KNN algorithm at the training phase just stores the dataset and when it gets new data, it classifies that data into a category that is similar to the latest data.

• Example:

1. Suppose, we have an image of a creature that looks similar to a cat and a dog, but we want to know whether it is a cat or a dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar features of the new data set to the cats and dog images and based on the most similar features it will put it in either the cat or dog category

2. Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below diagram

How does K-NN work?

• The K-NN working can be explained on the basis of the below algorithm:

• Step-1: Select the number K of the neighbors

• Step-2: Calculate the Euclidean distance of K number of neighbors

• Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

• Step-4: Among these k neighbors, count the number of the data points in each category.

• Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.

• Step-6: Our model is ready.

Firstly, we will choose the number of neighbors, so we will choose the k=5.

• Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance between two points, which we have already studied in geometry. It can be calculated as:

• By calculating the Euclidean distance we got the nearest neighbors, as three

nearest neighbors in category A and two nearest neighbors in category B. Consider the below image:

• As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to category A.

• There is no particular way to determine the best value for "K", so we need to try some values to find the best out of them. The most preferred value for K is 5.

• A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the model.

• Large values for K are good, but it may find some difficulties.

Solved Example1:

Let’s look at an example of KNN Classification, based on two attributes: Acid durability and strength. The goal is to classify a paper tissue into good/bad quality classes

Points	X1(Acid Durability )	X2(strength)	Y=Classification
P1	7	7	BAD
P2	7	4	BAD
P3	3	4	GOOD
P4	1	4	GOOD
P5	3	7	?

Euclidean Distance from Each Point

Euclidean Distance of P5(3,7) from	P1	P2	P3	P4
	(7,7)	(7,4)	(3,4)	(1,4)
	4	5	3	3.60
Class	BAD	BAD	GOOD	GOOD

K=3

Output

Points	X1(Acid Durability )	X2(strength)	Y=Classification
P1	7	7	BAD
P2	7	4	BAD
P3	3	4	GOOD
P4	1	4	GOOD
P5	3	7	GOOD

Example 2:

Name	Age	Gender	Sport
A	32	M	Football
M	40	M	None
S	36	F	Cricket
Z	34	F	Cricket
R	55	M	None
P	40	F	Cricket
S	20	M	None
L	55	F	Cricket
M	15	M	Football
A	5	F	? (Football)

Calculate the Distance and Select k=3

Name	Age	Gender	Sport	Distance
A	32	M	Football	27.02
M	40	M	None	35.01
S	36	F	Cricket	31
Z	34	F	Cricket	29
R	55	M	None	50
P	40	F	Cricket	35
S	20	M	None	15.02
L	55	F	Cricket	45
M	15	M	Football	10.04
A	5	F	? (Football)

As K=3, out of the 3 nearest neighbors two belongs to Football ,So the classification is Football

Example 3:

Maths	CS	Result
4	3	Fail
6	7	Pass
7	8	Pass
5	5	Fail
8	8	Pass
6	8	?

K=3

Maths	CS	Result	Distance
4	3	Fail	5.38
6	7	Pass	1
7	8	Pass	1
5	5	Fail	3.16
8	8	Pass	2
6	8	?Pass

Example Dataset 4) :

Consider a dataset with the following data points, each with a single feature and a binary class label

Data Point	Feature	Class
A	2	0
B	3	0
C	6	1
D	8	1

Problem

Given a new data point with Feature = 5, we want to predict its class using the KNN algorithm

Load the Data: Load the dataset and split it into features (single feature) and target labels (class).

Choose the Value of "k": Select the number of nearest neighbors to consider (value of "k"). For this example, let's choose "k" to be 3.

Calculate Distances: Calculate the distance between the new data point (Feature = 5) and all data points in the training set using a distance metric (e.g., absolute difference).

Select Nearest Neighbors: Identify the 3 data points with the smallest distances to the new data point.

Majority Voting: Determine the class labels of the 3 nearest neighbors and perform majority voting to predict the class of the new data point.

Prediction: Assign the class label that received the most votes as the predicted class for the new data point.

Example:

Given the new data point with Feature = 5, let's calculate the distances to all data points:

Distance between new point and A: |5 - 2| = 3

Distance between new point and B: |5 - 3| = 2

Distance between new point and C: |5 - 6| = 1

Distance between new point and D: |5 - 8| = 3

The three nearest neighbors are B (distance 2), C (distance 1), and A (distance 3). Among these neighbors, we have one instance of Class 0 (B) and two instances of Class 1 (C and A). Therefore, we predict the new data point with Feature = 5 to belong to Class 1.

Conclusion:

Using KNN with "k" set to 3, we predict that the new data point with Feature = 5 belongs to Class 1.

This example showcases a basic use of KNN with a single feature for classification. In practice, you'd typically work with multiple features and a larger dataset, and you might need to fine-tune the choice of "k" and consider additional aspects like feature scaling and distance metrics.

Example 5: Dataset Description:

Consider a dataset with the following four data points, each belonging to one of two classes (Class A and Class B). The dataset is characterized by two numerical features, Feature 1 and Feature 2.

Data Point	Feature 1	Feature 2	Class
A	2	3	Class A
B	4	2	Class B
C	5	8	Class B
D	7	10	Class A

Prediction Task:

Given a new data point with Feature 1 = 6 and Feature 2 = 5, we want to use KNN to classify it into either Class A or Class B.

Algorithm Steps:

Choose the value of "k" (number of nearest neighbors). Let's say we choose k = 3.

Calculate the distance between the new data point and all the existing data points using a distance metric (e.g., Euclidean distance):

Distance between new point and A: √((6-2)^2 + (5-3)^2) = √20

Distance between new point and B: √((6-4)^2 + (5-2)^2) = √13

Distance between new point and C: √((6-5)^2 + (5-8)^2) = √10

Distance between new point and D: √((6-7)^2 + (5-10)^2) = √26

Select the "k" nearest neighbors based on the calculated distances. For k = 3, the nearest neighbors are B, C, and A.

Count the occurrences of each class among the "k" nearest neighbors:

Class A: 1 (A)

Class B: 2 (B, C)

Assign the class label based on the majority class among the nearest neighbors. In this case, the majority class is Class B, so the new data point is predicted to belong to Class B.

Conclusion:

Using KNN with k = 3, we have predicted that the new data point with Feature 1 = 6 and Feature 2 = 5 belongs to Class B.

Keep in mind that this example is simplified for illustrative purposes. In practice, you might need to consider various factors such as scaling of features, different distance metrics, and cross-validation to find the optimal value of "k."

Example 6: Dataset called the Iris dataset.

The Iris dataset contains samples of iris flowers with their sepal and petal measurements, and the task is to classify the flowers into three species: Setosa, Versicolor, and Virginica.

Dataset:

The Iris dataset has the following features:

Sepal length (in cm)

Sepal width (in cm)

Petal length (in cm)

Petal width (in cm)

And the target variable is the species:

Setosa

Versicolor

Virginica

Here's a snippet of the Iris dataset:

Sepal Length	Sepal Width	Petal Length	Petal Width	Species
5.1	3.5	1.4	0.2	Setosa
4.9	3.0	1.4	0.2	Setosa
7.0	3.2	4.7	1.4	Versicolor
6.4	3.2	4.5	1.5	Versicolor
6.3	3.3	6.0	2.5	Virginica
5.8	2.7	5.1	1.9	Virginica
5.8	3.1	4.7	1.5	?

Problem:

Given the sepal length, sepal width, petal length, and petal width of an iris flower, we want to predict its species using the KNN algorithm.

Steps:

Load the Data: Load the Iris dataset and split it into features (sepal length, sepal width, petal length, petal width) and target labels (species).

Preprocess Data: It's good practice to scale the features since KNN is sensitive to the scale of the features. You can use standardization or normalization.

Choose the Value of "k": Select the number of nearest neighbors to consider (value of "k"). This can be chosen through techniques like cross-validation.

Calculate Distances: Calculate the distance between the new data point and all data points in the training set using a distance metric (e.g., Euclidean distance).

Select Nearest Neighbors: Identify the "k" data points with the smallest distances to the new data point.

Majority Voting: Determine the class labels of the "k" nearest neighbors and perform majority voting to predict the class of the new data point.

Prediction: Assign the class label that received the most votes as the predicted class for the new data point.

Example:

Let's say we have a new iris flower with the following measurements:

Sepal length: 5.8 cm

Sepal width: 3.1 cm

Petal length: 4.7 cm

Petal width: 1.5 cm

Assuming we choose "k" to be 5, we perform the following steps:

Calculate distances to all training data points.

Select the 5 nearest neighbors based on the smallest distances.

Among these neighbors, if 3 of them belong to species Setosa and 2 belong to species Versicolor, we predict the new iris to be of species Setosa.

KNN as regression

Age	Weight
20	50
24	54
25	52
30	60

What is the age when weight is 56

K=3? 3 nearest data point at weight 54, 52 and 60

The prediction of age is 24+25+30/3=26.3

Mean

1. Arithmetic mean

The arithmetic mean is calculated by summing up all the numbers in a dataset and then dividing by the total number of data points.

Mathematically, the arithmetic mean of a set of numbers x1, x2,…,,…,xn is calculated as:

Mean=x1+x2+…+xn/n

Where:

x1,x2,…,,…,xn are the individual values in the dataset.

n is the total number of data points.

2. Geometric Mean

The geometric mean of a set of positive numbers x1,x2,…,,…,xn is calculated by taking the nth root of the product of the numbers:

Where:

x1,x2,…,,…,xn are the positive values in the dataset.

n is the total number of data points.

For example, consider a stock's annual growth rates over 5 years: 0.05 (5%), 0.08 (8%), 0.12 (12%), 0.09 (9%), and 0.10 (10%). To calculate the geometric mean of these growth rates:

=0.0885

So, the geometric mean of the annual growth rates is approximately 0.0885 or 8.85%.

One important property of the geometric mean is that it's less sensitive to extreme values (outliers) compared to the arithmetic mean. This can make it a more appropriate measure of central tendency when dealing with data that has a multiplicative relationship or when there are outliers that could distort the arithmetic mean.

3. Harmonic Mean

The harmonic mean is another measure of central tendency that is used to find the average of a set of values. Unlike the arithmetic mean and geometric mean, the harmonic mean is particularly suited for situations where the values are rates or ratios. It is defined as the reciprocal of the arithmetic mean of the reciprocals of a set of numbers.

Mathematically, the harmonic mean of a set of positive numbers x1,x2,…,,…,xn is calculated as:

Where:

x1, x2,…,xn are the positive numbers in the dataset.

n is the total number of data points.

Advantages of KNN:

K-Nearest Neighbors (KNN) is a simple and intuitive machine learning algorithm that has several advantages in certain contexts. Here are some of the advantages of KNN

Simplicity and Ease of Implementation: KNN is a relatively simple algorithm to understand and implement. It doesn't require complex mathematical derivations or optimization techniques, making it accessible for beginners and experts alike.

No Training Phase: KNN is a lazy learning algorithm, which means it doesn't have an explicit training phase. Instead, it stores the entire training dataset in memory, and classification or regression decisions are made on-the-fly as new data points arrive.

Adaptability to Data: KNN can handle non-linear data and doesn't assume any specific distribution of the data. This makes it suitable for a wide range of applications, especially when the underlying data distribution is complex or unknown.

Instance-Based Learning: KNN is an instance-based learning algorithm, meaning it doesn't build a generalized model from the data. It simply memorizes the training data and uses it for making predictions, which can be advantageous when the underlying relationships are difficult to capture with a fixed model.

Interpretable Results: KNN's decision-making process is transparent and easy to understand. Predictions are based on the actual data points in the training set, making it easier to interpret why a particular prediction was made.

Handles Multiclass Classification: KNN can naturally handle multiclass classification problems without requiring major modifications. The decision process involves comparing a new data point to multiple neighbors in the training set, allowing it to assign a class label based on the most prevalent class among the k-nearest neighbors.

No Assumptions About Data: KNN doesn't make strong assumptions about the data distribution or the relationship between features. It is a non-parametric algorithm, which means it doesn't rely on assumptions such as linearity, normality, etc.

Robustness to Noise: KNN can handle noisy data because it relies on a voting mechanism based on a set of nearest neighbors. Outliers or noisy data points can have less impact compared to algorithms that use a global model.

Incremental Learning: KNN can easily adapt to new data points without retraining the entire model. This can be useful when you have a continuous stream of incoming data and you want to update your predictions in real-time.

Parameter Tuning: The hyperparameter "k" in KNN allows you to control the trade-off between bias and variance. By adjusting the value of "k," you can make the algorithm more sensitive to local patterns (low k) or more robust to noise (high k).

However, it's important to note that while KNN has its advantages; it also has limitations, such as high computational cost for large datasets, sensitivity to the choice of distance metric, and difficulties handling high-dimensional data effectively. The choice to use KNN should be based on the specific characteristics of your data and problem domain.

Disadvantages of KNN:

While K-Nearest Neighbors (KNN) has several advantages, it also comes with certain disadvantages and limitations that should be taken into consideration when deciding whether to use this algorithm:

Computational Complexity: KNN's computational complexity increases as the size of the dataset grows, especially with a larger value of "k." For each prediction, KNN requires calculating the distances between the new data point and all training data points, which can become inefficient for large datasets.

Memory Usage: KNN needs to store the entire training dataset in memory since it relies on the training instances for making predictions. This can be memory-intensive for large datasets, and memory constraints might limit its applicability.

Curse of Dimensionality: KNN's performance can deteriorate in high-dimensional feature spaces. As the number of dimensions increases, the density of data points becomes more uniform, making it difficult to identify meaningful neighbors. This phenomenon is often referred to as the "curse of dimensionality."

Sensitive to Noise and Outliers: KNN can be sensitive to noisy data or outliers in the training set. Noisy data can lead to incorrect classification decisions if the nearest neighbors include these noisy instances. Outliers can significantly affect the location of the decision boundaries.

Choice of Distance Metric: KNN's performance heavily depends on the choice of distance metric (e.g., Euclidean distance, Manhattan distance, etc.). The choice of metric can have a significant impact on the algorithm's behavior and effectiveness.

Local Optima: KNN might get stuck in local optima or produce suboptimal results when the dataset contains isolated clusters or regions of varying densities. It tends to follow the majority class in those clusters, potentially missing minority classes or finer patterns.

Imbalanced Data: KNN can struggle with imbalanced datasets where one class is much more prevalent than the others. Since KNN treats all neighbors equally, a majority class might dominate the predictions, leading to biased results.

Lack of Generalization: KNN doesn't capture global patterns or relationships in the data, as it makes decisions based solely on local neighborhoods. This can lead to poor generalization when data points are not uniformly distributed.

Parameter Sensitivity: The choice of the parameter "k" (number of neighbors) can significantly impact the algorithm's performance. A small "k" might lead to noisy decisions, while a large "k" can oversmooth the decision boundaries and miss local patterns.

No Feature Importance: KNN treats all features equally without considering their individual importance or relevance. This can be problematic when some features are more informative than others for making predictions.

Limited Feature Engineering: KNN doesn't automatically learn or adapt to feature interactions. Feature engineering, if necessary, needs to be done manually prior to using KNN.

Prediction Time: Since KNN doesn't precompute a model, prediction time can be relatively slow, especially for large datasets, as it needs to calculate distances for every prediction.

In summary, while KNN has its strengths, such as simplicity and flexibility, it also has limitations that can affect its performance in certain scenarios. It's important to carefully consider your data characteristics and problem requirements before choosing KNN as your algorithm of choice.

Top of Form

Search This Blog

Machine Learning and Deep Learning

k-Nearest Neighbors Algorithm

Comments

Post a Comment

Popular posts from this blog

Linear Regression

Logistic Regression

Support Vector Machines- I