In this module, we will explore how the K-Nearest Neighbors (KNN) algorithm works and apply it to classify fruits using physical characteristics like mass, width, height, and color score.
KNN is a simple yet powerful algorithm for classification and regression that uses the proximity of data points to make predictions.
How KNN Works
📌 Core Steps
Choose the number of neighbors K (usually an odd number like 3, 5, or 7).
Calculate distance (e.g., Euclidean) from the new point to all existing data points.
Identify the K closest neighbors.
Predict by majority vote (for classification) or average value (for regression).
Example:
If you’re given data on the size and color of known apples, oranges, lemons, etc., and then introduced to a new “mystery fruit”, KNN can tell you what kind of fruit it most likely is — based on how similar it is to others.
Euclidean Distance
KNN relies heavily on distance. The Euclidean distance is most common:
For two points A = (x₁, y₁) and B = (x₂, y₂), \(d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}\)
In multi-feature datasets, each feature is a dimension in the space.
KNN in Action: Example Codes
Install package once (if needed) install.packages("class")
# label maplabel_map <-levels(fruit_data$fruit_name)names(label_map) <-levels(fruit_data$fruit_label)# confusion matrix transfer to heatmapcm_df <-as.data.frame(cm$table)cm_df$Reference <-factor(cm_df$Reference, levels =names(label_map), labels = label_map)cm_df$Prediction <-factor(cm_df$Prediction, levels =names(label_map), labels = label_map)# heatmap visualizationggplot(cm_df, aes(x = Reference, y = Prediction, fill = Freq)) +geom_tile(color ="white") +geom_text(aes(label = Freq), color ="black", size =6) +scale_fill_gradient(low ="white", high ="steelblue") +labs(title ="KNN Confusion Matrix (Fruit Names)", x ="True Label", y ="Predicted Label") +theme_minimal()
🔀 Model Comparison: KNN vs Logistic Regression
library(nnet) # for multinom()library(e1071) # confusionMatrix# using the same features in datatrain_df <-data.frame(train_x)train_df$label <- train_ytest_df <-data.frame(test_x)test_df$label <- test_y# Multinomial logistic regressionlogit_model <-multinom(label ~ ., data = train_df)
# weights: 24 (15 variable)
initial value 60.996952
iter 10 value 13.831555
iter 20 value 12.482299
iter 30 value 12.435228
final value 12.434969
converged
# predictionlogit_pred <-predict(logit_model, newdata = test_df)cm_logit <-confusionMatrix(logit_pred, test_y)# comparisoncm_logit_df <-as.data.frame(cm_logit$table)cm_logit_df$Reference <-factor(cm_logit_df$Reference, levels =names(label_map), labels = label_map)cm_logit_df$Prediction <-factor(cm_logit_df$Prediction, levels =names(label_map), labels = label_map)ggplot(cm_logit_df, aes(x = Reference, y = Prediction, fill = Freq)) +geom_tile(color ="white") +geom_text(aes(label = Freq), color ="black", size =6) +scale_fill_gradient(low ="white", high ="tomato") +labs(title ="Logistic Regression Confusion Matrix (Fruit Names)", x ="True Label", y ="Predicted Label") +theme_minimal()
In this module, we explored the K-Nearest Neighbors (KNN) algorithm from its theoretical foundation to practical implementation using a fruit classification task. Through both toy examples and real datasets, we gained a deeper understanding of how similarity-based classification works.
Key Takeaways:
KNN is a non-parametric, instance-based learning method that predicts new data points by comparing their distance to known labeled data.
Normalization of features was critical, especially since Euclidean distance is sensitive to scale.
Model performance varied with the value of K, with K = 1 showing perfect classification for this particular dataset. However, this could lead to overfitting in noisier or more complex data.
Visualization using confusion matrices helped us interpret model results at a glance, especially when labeled with meaningful class names (like apple, orange, etc.).
We compared KNN to Logistic Regression, a probabilistic model, and found that while KNN achieved perfect accuracy in this specific task, logistic regression showed slightly lower accuracy and Kappa. This demonstrates the potential of KNN for small, structured, and well-separated datasets.
When to Use KNN:
KNN is particularly useful when: - You want a simple baseline model - You have a small to medium-sized dataset - Interpretability and flexibility are more important than speed
However, its limitations in terms of: - Computational cost (especially with large datasets) - Sensitivity to irrelevant features or high dimensionality should be kept in mind.
Future Directions:
Try changing the distance metric (e.g., Manhattan, cosine) and evaluate impact.
Apply dimensionality reduction techniques like PCA to see how performance and interpretability are affected.
Compare KNN with other classifiers such as Decision Trees, Random Forests, or SVMs for larger datasets or more complex feature spaces.
In summary, KNN provides a solid foundation for understanding proximity-based machine learning and offers valuable intuition for feature-space dynamics in classification tasks.