Coursera Learner working on a presentation with Coursera logo and
Coursera Learner working on a presentation with Coursera logo and

This machine-learning algorithm is easy and straightforward to understand. You can solve regression and classification problems with machine learning methods. To understand the concept of K-nearest neighbor, you must first know how a supervised machine learning technique works. In supervised learning, you provide the model with labeled data. The machine then analyzes the labeled data and drive a suitable output. 

For instance, when children are young, they need supervision to understand the difference between different colors, animals, alphabets, etc. You label all the items for them so they can quickly identify them. That is how supervised machine learning works. This learning helps solve classification problems. In this type of problem, you predict the value of the input data, and then the algorithm arranges the values in different classes based on the weight of the functions and their input features.

K-Nearest Neighbors

To solve the regression and classification, you can use this algorithm for supervised machine learning method. KNN does not include any learning phase. They do not calculate the predictive method like linear or logistic regression. This algorithm finds the similarities between the variables. They measure how close or distant the variables are based on the data given. In simple words, this algorithm thinks that the more closely the things are to each other, the more similar they are.

How the KNN Algorithm Works

You need to follow the following method to implement the K-Nearest Neighbor Algorithm properly (for Python):

  • Load the input or data on the machine from the data set
  • Modify the value of K from the variables
  • Calculate the distance of the variables with respect to each example of the data.
  • Now add the distance in an ordered form
  • Now sort out the ordered collection in ascending order based on their distance.
  • Pick the K from the collection
  • Now find the labels of that K value
  • If it is regression, you need to return back the mean of the label “K.”
  • If it is the classification, then you need to return the mode of the label “K.” 

Implementation of the Coding on 

fromcollections importCounter
import math
defknn(data, query, k, distance_fn, choice_fn):
    neighbor_distances_and_indices = []
# 3. For each example in the data
for index, example inenumerate(data):
# 3.1 Calculate the distance between the query example and the current
# example from the data.
        distance =distance_fn(example[:-1], query)
# 3.2 Add the distance and the index of the example to an ordered collection
        neighbor_distances_and_indices.append((distance, index))
# 4. Sort the ordered collection of distances and indices from
# smallest to largest (in ascending order) by the distances
    sorted_neighbor_distances_and_indices =sorted(neighbor_distances_and_indices)
# 5. Pick the first K entries from the sorted collection
    k_nearest_distances_and_indices = sorted_neighbor_distances_and_indices[:k]
# 6. Get the labels of the selected K entries
    k_nearest_labels = [data[i][1] for distance, i in k_nearest_distances_and_indices]
# 7. If regression (choice_fn = mean), return the average of the K labels
# 8. If classification (choice_fn = mode), return the mode of the K labels
return k_nearest_distances_and_indices , choice_fn(k_nearest_labels)
defmean(labels):
returnsum(labels) /len(labels)
defmode(labels):
returnCounter(labels).most_common(1)[0][0]
defeuclidean_distance(point1, point2):
    sum_squared_distance =0
for i inrange(len(point1)):
        sum_squared_distance += math.pow(point1[i] – point2[i], 2)
return math.sqrt(sum_squared_distance)
defmain():
”’
    # Regression Data
    # 
    # Column 0: height (inches)
    # Column 1: weight (pounds)
    ”’
    reg_data = [
       [65.75, 112.99],
       [71.52, 136.49],
       [69.40, 153.03],
       [68.22, 142.34],
       [67.79, 144.30],
       [68.70, 123.30],
       [69.80, 141.49],
       [70.01, 136.46],
       [67.90, 112.37],
       [66.49, 127.45],
    ]
# Question:
# Given the data we have, what’s the best-guess at someone’s weight if they are 60 inches tall?
    reg_query = [60]
    reg_k_nearest_neighbors, reg_prediction =knn(
        reg_data, reg_query, k=3, distance_fn=euclidean_distance, choice_fn=mean
    )
”’
    # Classification Data
    # 
    # Column 0: age
    # Column 1: likes pineapple
    ”’
    clf_data = [
       [22, 1],
       [23, 1],
       [21, 1],
       [18, 1],
       [19, 1],
       [25, 0],
       [27, 0],
       [29, 0],
       [31, 0],
       [45, 0],
    ]
# Question:
# Given the data we have, does the unspecified M&M fit in Red or Green M&M’s set?
    clf_query = [33]
    clf_k_nearest_neighbors, clf_prediction =knn(
        clf_data, clf_query, k=3, distance_fn=euclidean_distance, choice_fn=mode
    )
if __name__ ==’__main__’:
main()

Understanding with the Example

Now let’s understand the above steps in easy words. Imagine there are green and red M&Ms on a plate. Here, you will find another M&M you do not know the class of. To find the class, you need to set a value of K. Let’s say in this condition K = 4. Now you will draw a circle around the unidentified M&M in a way that the center of the circle is that M&M itself, and only the four other M&Ms are in the circle. Now we will check which class of M&Ms does the circle has more of. Let’s say if there were four red M&Ms, then we will consider the unidentified M&M as a red class.

When can you Use KNN Algorithm?

You can use the K-Nearest Neighbor Algorithm to solve the regression or classification problem. Many industries use this supervised machine learning system for classification problems. Here are three important factors in evaluating any technique:

  • How easy can you interpret the output?
  • How can you calculate the output?
  • What is the predictive power?

KNN is perfect for all the above parameters to measure a technique. However, the KNN technique has frequent applications because it is easy to interpret and calculate the output time.

How can you choose the K Value?

To find the best K value, we need to run the algorithm many times and check which number reduces the errors and still maintain the ability of the algorithm and make predictions. Here are some of the things that you need to keep in mind:

  • The first thing that you should not do is to select one as a K value. When you select one as a K value, we get unstable predictions. For instance, if we need to find the class of an M&M that is completely surrounded by red M&Ms but only one green M&M is near the unidentified M&M. In this condition, we will think that the query point is green, so our prediction will be wrong.
  • When we increase the K value, we will see that the prediction will become stable because of the majority of voters. However, when we find more errors, we are going too far from the K value.
  • When there is a tie between the predictions, we need to choose the odd number.

The choice of K value depends on the dataset you are trying to use. However, you should know that the more neighbors we add, the more accurate results we will get.

Conclusion

In this article, we tried to provide a basic and easy to understand of the K-Nearest Neighbor Algorithm concept. We learned that KNN helps in predicting the classification with the given dataset. This is an easy and fast method for calculation. 

Languages

Weekly newsletter

No spam. Just the latest releases and tips, interesting articles, and exclusive interviews in your inbox every week.