Classification techniques are an important a part of machine learning and data processing applications. Approximately 70% of problems in Data Science are classification problems. There are many classification problems that are available, but the logistics regression is common and may be a useful regression method for solving the binary classification problem. Another category of classification is Multinomial classification, which handles the problems where multiple classes are present within the target variable. for instance , IRIS dataset a really famous example of multi-class classification. Other examples are classifying article/blog/document category.
Logistic Regression are often used for various classification problems like spam detection. Diabetes prediction, if a given customer will purchase a specific product or will they churn another competitor, whether the user will click on a given advertisement link or not, and lots of more examples are within the bucket.
Logistic Regression is one among the foremost simple and commonly used Machine Learning algorithms for two-class classification. it’s easy to implement and may be used because the baseline for any binary classification problem. Its basic fundamental concepts also are constructive in deep learning. Logistic regression describes and estimates the connection between one dependent binary variable and independent variables.
Logistic regression may be a statistical procedure for predicting binary classes. the result or target variable is dichotomous in nature. Dichotomous means there are only two possible classes. for instance , it are often used for cancer detection problems. It computes the probability of an occasion occurrence.
It is a special case of rectilinear regression where the target variable is categorical in nature. It uses a log of odds because the variable . Logistic Regression predicts the probability of occurrence of a binary event utilizing a logit function.
Linear Regression Equation:
Where, y depends variable and x1, x2 … and Xn are explanatory variables.
Apply Sigmoid function on linear regression:
Properties of Logistic Regression:
The variable in logistic regression follows binomial distribution .
Estimation is completed through maximum likelihood.
No R Square, Model fitness is calculated through Concordance, KS-Statistics.
Linear Regression Vs. Logistic Regression
Linear regression gives you endless output, but logistic regression provides a continuing output. An example of the continual output is house price and stock price. Example’s of the discrete output is predicting whether a patient has cancer or not, predicting whether the customer will churn. rectilinear regression is estimated using Ordinary method of least squares (OLS) while logistic regression is estimated using Maximum Likelihood Estimation (MLE) approach.
Maximum Likelihood Estimation Vs. Least Square Method
The MLE may be a “likelihood” maximization method, while OLS may be a distance-minimizing approximation method. Maximizing the likelihood function determines the parameters that are presumably to supply the observed data. From a statistical point of view, MLE sets the mean and variance as parameters in determining the precise parametric values for a given model. This set of parameters are often used for predicting the info needed during a Gaussian distribution .
Ordinary method of least squares estimates are computed by fitting a regression curve on given data points that has the minimum sum of the squared deviations (least square error). Both are wont to estimate the parameters of a rectilinear regression model. MLE assumes a probability mass function, while OLS doesn’t require any stochastic assumptions for minimizing distance.
The sigmoid function, also called logistic function gives an ‘S’ shaped curve which will take any real-valued number and map it into a worth between 0 and 1. If the curve goes to positive infinity, y predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0. If the output of the sigmoid function is quite 0.5, we will classify the result as 1 or YES, and if it’s but 0.5, we will classify it as 0 or NO. The outputcannotFor example: If the output is 0.75, we will say in terms of probability as: there’s a 75 percent chance that patient will suffer from cancer.
Types of Logistic Regression
Binary Logistic Regression: The target variable has only two possible outcomes like Spam or Not Spam, Cancer or No Cancer.
Multinomial Logistic Regression: The target variable has three or more nominal categories like predicting the sort of Wine.
Ordinal Logistic Regression: the target variable has three or more ordinal categories like restaurant or product rating from 1 to five .
Model building in Scikit-learn
Let’s build the diabetes prediction model.
Here, you’re getting to predict diabetes using Logistic Regression Classifier.
Let’s first load the specified Pima Indian Diabetes dataset using the pandas’ read CSV function. you’ll download data from the subsequent link: https://www.kaggle.com/uciml/pima-indians-diabetes-database
Loading #import pandas
import pandas as pd
col_names = [‘pregnant’, ‘glucose’, ‘bp’, ‘skin’, ‘insulin’, ‘bmi’, ‘pedigree’, ‘age’, ‘label’]
# load dataset
pima = pd.read_csv(“pima-indians-diabetes.csv”, header=None, names=col_names)
Here, you would like to divide the given columns into two sorts of variables dependent(or target variable) and independent variable(or feature variables).
#split dataset in features and target variable
feature_cols = [‘pregnant’, ‘insulin’, ‘bmi’, ‘age’,’glucose’,’bp’,’pedigree’]
X = pima[feature_cols] # Features
y = pima.label # Target variable
To understand model performance, dividing the dataset into a training set and a test set may be a good strategy.
Let’s split dataset by using function train_test_split(). you would like to pass 3 parameters features, target, and test_set size. Additionally, you’ll use random_state to pick records randomly.
# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
/home/admin/.local/lib/python3.5/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also, note that the interface of the new CV iterators is different from that of this module. This module are going to be removed in 0.20.
“This module are going to be removed in 0.20.”, DeprecationWarning)
Here, the Dataset is broken into two parts during a ratio of 75:25. It means 75% data are going to be used for model training and 25% for model testing.
Model Development and Prediction
First, import the Logistic Regression module and make a Logistic Regression classifier object using LogisticRegression() function.
Then, suit your model on the plaything using fit() and perform prediction on the test set using predict().
# import the category
from sklearn.linear_model import LogisticRegression
# instantiate the model (using the default parameters)
logreg = LogisticRegression()
# fit the model with data
Model Evaluation using Confusion Matrix
A confusion matrix may be a table that’s wont to evaluate the performance of a classification model. you’ll also visualize the performance of an algorithm. the elemental of a confusion matrix is that the number of correct and incorrect predictions are summed up class-wise.
# import the metrics class
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
[ 26, 36]])
Here, you’ll see the confusion matrix within the sort of the array object. The dimension of this matrix is 2*2 because this model is binary classification. you’ve got two classes 0 and 1. Diagonal values represent accurate predictions, while non-diagonal elements are inaccurate predictions. within the output, 119 and 36 are actual predictions, and 26 and 11 are incorrect predictions.
Visualizing Confusion Matrix using Heatmap
Let’s visualize the results of the model within the sort of a confusion matrix using matplotlib and seaborn.
Here, you’ll visualize the confusion matrix using Heatmap.
# import required modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap=”YlGnBu” ,fmt=’g’)
plt.title(‘Confusion matrix’, y=1.1)
Confusion Matrix Evaluation Metrics
Let’s evaluate the model using model evaluation metrics like accuracy, precision, and recall.
Well, you bought a classification rate of 80%, considered nearly as good accuracy.
Precision: Precision is about being precise, i.e., how accurate your model is. In other words, you’ll say, when a model makes a prediction, how often it’s correct. In your prediction case, when your Logistic Regression model predicted patients are getting to suffer from diabetes, that patients have 76% of the time.
Recall: If there are patients who have diabetes within the test set and your Logistic Regression model can identify it 58% of the time.
Receiver Operating Characteristic(ROC) curve may be a plot of truth positive rate against the false positive rate. It shows the tradeoff between sensitivity and specificity.
y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label=”data 1, auc=”+str(auc))
AUC score for the case is 0.86. AUC score 1 represents perfect classifier, and 0.5 represents a worthless classifier.
Because of its efficient and easy nature, doesn’t require high computation power, easy to implement, easily interpretable, used widely by data analyst and scientist. Also, it doesn’t require scaling of features. Logistic regression provides a probability score for observations.
Logistic regression isn’t ready to handle an outsized number of categorical features/variables. it’s susceptible to overfitting. Also, can’t solve the non-linear problem with the logistic regression that’s why it requires a change of non-linear features. Logistic regression won’t perform well with independent variables that aren’t correlated to the target variable and are very similar or correlated to every other.