Module 4: K-Nearest Neighbors in Python

Lesson - 5: Introduction to K-Nearest Neighbors

 

 

Welcome to Module 4 of our journey through Machine Learning with Python! In this lesson, we're going to explore the K-Nearest Neighbors (KNN) algorithm, a simple yet powerful technique for classification and regression tasks. KNN is a non-parametric algorithm that makes predictions based on the majority class or average value of the K nearest data points. By the end of this lesson, you'll have a thorough understanding of how KNN works and how to implement it in Python.


How KNN Works


K-Nearest Neighbors is an instance-based learning algorithm that stores all available cases and classifies new cases based on a similarity measure. The key steps involved in KNN are as follows:


  1. Calculate Distance: Compute the distance between the query instance and all the training samples.

   

  1. Find Nearest Neighbors: Select the K nearest neighbors based on the calculated distances.

  1. Majority Vote (Classification) / Average (Regression): For classification tasks, assign the class label that appears most frequently among the K nearest neighbors. For regression tasks, predict the average of the target values of the K nearest neighbors.

Choosing the Value of K


The choice of K has a significant impact on the performance of the KNN algorithm. A small value of K may lead to overfitting, whereas a large value of K may lead to underfitting. Some common methods for choosing the value of K include:


- Rule of Thumb: Start with K=sqrt(n), where n is the number of samples in the training set.

  

- Cross-Validation: Use techniques like cross-validation to select the optimal value of K that maximizes performance on unseen data.


Pros and Cons of KNN


Like any algorithm, KNN has its strengths and weaknesses:


Pros:

- Simple and easy to understand.

- No training phase; the algorithm stores all training data.

- Can be used for both classification and regression tasks.


Cons:

- Computationally expensive, especially for large datasets.

- Sensitive to irrelevant features and the scale of the data.

- Prediction time increases with the size of the training set.


Implementation of KNN in Python


Now, let's implement KNN in Python using the scikit-learn library:


```python

from sklearn.neighbors import KNeighborsClassifier

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score


# Load the Iris dataset

iris = load_iris()

X, y = iris.data, iris.target


# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Create and fit the KNN classifier

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)


# Predict on the testing set

y_pred = knn.predict(X_test)


# Calculate accuracy

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

```

Conclusion

In this lesson, we've explored the K-Nearest Neighbors (KNN) algorithm, a simple yet effective technique for classification and regression tasks. We've learned how KNN works, how to choose the value of K, and the pros and cons of using KNN. By implementing KNN in Python, you now have the tools to apply this algorithm to your own datasets and solve real-world problems. Keep experimenting, exploring, and honing your skills in Machine Learning. 


Modules