Logistic Regression in ML

Table of Contents:

Content Highlight:

Logistic regression is a classification algorithm used to predict categorical outcomes. It is categorized into three types: Binomial Logistic Regression, which deals with two possible outcomes (e.g., Yes/No, Pass/Fail); Multinomial Logistic Regression, which handles three or more unordered categories (e.g., Dog, Cat, Rabbit); and Ordinal Logistic Regression, which classifies ordered categories (e.g., Low, Medium, High). Each type is suited for different classification problems, helping in medical diagnosis, spam detection, risk assessment, and customer segmentation.

What is the K-Nearest Neighbors (KNN) Algorithm in Machine Learning?

The K-Nearest Neighbors (KNN) algorithm is a simple yet powerful supervised learning technique used for classification and regression tasks. It operates on the principle of similarity, classifying new data points based on the majority class of their closest neighbors in the feature space.

Unlike traditional models that learn patterns during training, KNN is a non-parametric, instance-based algorithm—meaning it does not make any prior assumptions about the data distribution and does not learn an explicit model. Instead, it stores the entire dataset and makes predictions by computing distances (such as Euclidean distance) between a new data point and its ‘k’ nearest neighbors.

Due to its simplicity, interpretability, and effectiveness, KNN is widely used in various applications like image recognition, recommendation systems, and medical diagnosis. However, it can become computationally expensive for large datasets as it requires calculating distances for every new prediction.

Why do we need a K-NN Algorithm?

Why Do We Need the K-Nearest Neighbors (KNN) Algorithm?

In machine learning, classification problems often involve determining which category a new data point belongs to, based on existing data. Suppose we have two categories, Category A and Category B, and we introduce a new data point x1. The challenge is to determine whether this new point should be assigned to Category A or Category B.

This is where the KNN algorithm comes into play. KNN helps in classifying a data point based on its nearest neighbors, leveraging similarity measures to make an informed decision. By analyzing the characteristics of nearby points, the algorithm effectively assigns the new data point to the most relevant category.

Classifying a New Data Point Using the K-Nearest Neighbors (KNN) Algorithm

Suppose we have a new data point, and we need to determine the category it belongs to. The KNN algorithm helps in this classification by analyzing the closest neighbors. Consider the diagram below:

How knn works?

Step 1: Choosing the Number of Neighbors (K)

The first step in the KNN algorithm is selecting the number of neighbors (K). In this case, we set K = 5, meaning that we will consider the 5 nearest neighbors to classify the new data point.

Step 2: Calculating the Euclidean Distance

Next, we calculate the distance between the new data point and all existing data points. The most commonly used distance metric in KNN is the Euclidean distance, which measures the straight-line distance between two points in a multi-dimensional space.

The formula for Euclidean distance between two points ( A(X_1, Y_1) ) and ( B(X_2, Y_2) ) is given by:

d(A, B) = √((X₂ - X₁)² + (Y₂ - Y₁)²)

Euclidean distance between two points.

Step 3: Identifying the Nearest Neighbors

After calculating the Euclidean distances, we determine the K nearest neighbors. As shown in the image below, the new data point has:

  • 3 nearest neighbors in Category A
  • 2 nearest neighbors in Category B

Step 4: Assigning the New Data Point to a Category:

Since the majority of the nearest neighbors (3 out of 5) belong to Category A, the new data point is classified as part of Category A.

How to Select the Best Value of K?

Choosing the right value of K is crucial for the performance of the KNN algorithm. Here are some important considerations:

  • There is no fixed rule for determining the best value of K. It is often selected through experimentation.
  • A commonly used value is K = 5, which provides a good balance between accuracy and computational efficiency.
  • A very small value (e.g., K = 1 or K = 2) may be sensitive to outliers and noise, leading to incorrect classifications.
  • A larger value of K improves stability but may struggle with fine-grained classifications.

By carefully selecting K, we can optimize the KNN model for accuracy and performance, ensuring reliable classifications in various machine learning applications.

How Does the K-Nearest Neighbors (KNN) Algorithm Work?

The KNN algorithm follows a structured process to classify or predict a new data point based on distance-based similarity. The steps involved are:

Step 1: Choose the Value of K:

The first step is selecting the number of nearest neighbors (K) that will influence the classification decision. A smaller K value makes the model sensitive to noise, while a larger K provides smoother decision boundaries.

Step 2: Compute the Distance Between Data Points:

The most commonly used distance metric is the Euclidean distance, which measures the straight-line distance between two points in a multi-dimensional space. Other distance metrics, such as Manhattan distance or Minkowski distance, can also be used depending on the dataset.

Step 3: Identify the K Nearest Neighbors:

Based on the computed distances, select the K closest data points to the new data point.

Step 4: Count the Neighbors in Each Category:

Among the selected K nearest neighbors, count the number of data points belonging to each class.

Step 5: Assign the New Data Point to the Majority Class:

The new data point is classified into the category that has the highest number of neighbors among the selected K.

Step 6: Model is Ready for Predictions

The trained KNN model can now classify or predict future data points based on the majority voting principle.

Key Advantages of KNN:

  • Simple and Intuitive: Does not require a complex mathematical model to operate.
  • No Assumption on Data Distribution: KNN is a non-parametric algorithm, meaning it does not assume any underlying distribution of the dataset.
  • Versatile: Can be used for both classification and regression tasks.
  • Adaptable: Works well with high-dimensional data when appropriate distance metrics are used.

However, KNN can be computationally expensive for large datasets since it requires calculating distances for each query. Efficient data structures like KD-Trees or Ball Trees can help optimize performance.

KNN remains a fundamental and widely used algorithm for pattern recognition, recommendation systems, and image classification, making it a valuable tool in machine learning.

Since we need a range between -∞ and +∞, we take the logarithm of the equation, resulting in:

log(y / (1 - y)) = b₀ + b₁x₁ + b₂x₂ + ... + bₙxₙ