Logistic regression (aka logit regression or logit model) was developed by statistician David Cox in 1958 and may be a regression model where the response variable Y is categorical. Logistic regression allows us to estimate the probability of a categorical response supported one or more predictor variables (X). It allows one to mention that the presence of a predictor increases (or decreases) the probability of a given outcome by a selected percentage. This tutorial covers the case when Y is binary — that’s , where it can take only two values, “0” and “1”, which represent outcomes like pass/fail, win/lose, alive/dead or healthy/sick. Cases where the variable has quite two outcome categories could also be analysed with multinomial logistic regression, or, if the multiple categories are ordered, in ordinal logistic regression. However, discriminant analysis has become a well-liked method for multi-class classification so our next tutorial will specialise in that technique for those instances.

Replication Requirements

This tutorial primarily leverages the Default data provided by the ISLR package. this is often a simulated data set containing information on ten thousand customers like whether the customer defaulted, may be a student, the typical balance carried by the customer and therefore the income of the customer. We’ll also use a couple of packages that provide data manipulation, visualization, pipeline modeling functions, and model output tidying functions.

# Packages

library(tidyverse)  # data manipulation and visualization

library(modelr)     # provides easy pipeline modeling functions

library(broom)      # helps to tidy up model outputs

# Load data 

(default <- as_tibble(ISLR::Default))

## # A tibble: 10,000 × 4

##    default student   balance income

##     <fctr> <fctr>     <dbl> <dbl>

## 1       No No 729.5265 44361.625

## 2       No Yes 817.1804 12106.135

## 3       No No 1073.5492 31767.139

## 4       No No 529.2506 35704.494

## 5       No No 785.6559 38463.496

## 6       No Yes 919.5885  7491.559

## 7       No No 825.5133 24905.227

## 8       No Yes 808.6675 17600.451

## 9       No No 1161.0579 37468.529

## 10      No No   0.0000 29275.268

## # … with 9,990 more rows

Why Logistic Regression

Linear regression isn’t appropriate within the case of a qualitative response. Why not? Suppose that we try to predict the medical condition of a patient within the ER on the idea of her symptoms. during this simplified example, there are three possible diagnoses: stroke, drug overdose, and convulsion . We could consider encoding these values as a quantitative response variable, Y , as follows:

Y=⎧⎨⎩1,if stroke;2,if drug overdose;3,if epileptic seizure.

Using this coding, method of least squares might be wont to fit a rectilinear regression model to predict Y on the idea of a group of predictors

 Unfortunately, this coding implies an ordering on the outcomes, putting drug overdose in between stroke and convulsion , and insisting that the difference between stroke and drug overdose is that the same because the difference between drug overdose and convulsion . In practice there’s no particular reason that this must be the case. as an example , one could choose an equally reasonable coding,

Y=⎧⎨⎩1,if epileptic seizure;2,if stroke;3,if drug overdose.

which would imply a completely different relationship among the three conditions. Each of those codings would produce fundamentally different linear models that might ultimately cause different sets of predictions on test observations.

More relevant to our data, if we try to classify a customer as a high- vs. low-risk defaulter supported their balance we could use linear regression; however, the left figure below illustrates how rectilinear regression would predict the probability of defaulting. Unfortunately, for balances on the brink of zero we predict a negative probability of defaulting; if we were to predict for very large balances, we might get values bigger than 1. These predictions aren’t sensible, since in fact truth probability of defaulting, no matter mastercard balance, must fall between 0 and 1.

To avoid this problem, we must model p(X) employing a function that provides outputs between 0 and 1 for all values of X. Many functions meet this description. In logistic regression, we use the logistic function, which is defined in Eq. 1 and illustrated within the right figure above.

p(X)=eβ0+β1X1+eβ0+β1X

Preparing Our Data

As within the regression tutorial, we’ll split our data into a training (60%) and testing (40%) data sets so we will assess how well our model performs on an out-of-sample data set.

set.seed(123)

sample <- sample(c(TRUE, FALSE), nrow(default), replace = T, prob = c(0.6,0.4))

train <- default[sample, ]

test <- default[!sample, ]

Simple Logistic Regression

We will fit a logistic regression model so as to predict the probability of a customer defaulting supported the typical balance carried by the customer. The glm function fits generalized linear models, a category of models that has logistic regression. The syntax of the glm function is analogous thereto of lm, except that we must pass the argument family = binomial so as to inform R to run a logistic regression instead of another sort of generalized linear model.

model1 <- glm(default ~ balance, family = “binomial”, data = train)

In the background the glm, uses maximum likelihood to suit the model. the essential intuition behind using maximum likelihood to suit a logistic regression model is as follows: we seek estimates for

such the anticipated probability ^p(xi) of default for every individual, using Eq. 1, corresponds as closely as possible to the individual’s observed default status. In other words, we attempt to find β0β^0 and ^β1 such plugging these estimates into the model for p(X), given in Eq. 1, yields variety on the brink of one for all individuals who defaulted, and variety on the brink of zero for all individuals who didn’t . This intuition are often formalized employing a mathematical equation called a likelihood function: ℓ(β0,β1)=∏i:yi=1p(xi)∏i′:y′i=0(1−p(x′i))

The estimates β0β^0 and ^β1 are chosen to maximise this likelihood function. Maximum chances are a really general approach that’s wont to fit many of the non-linear models that we’ll examine in future tutorials. What results may be a an S-shaped probability curve illustrated below (note that to plot the logistic regression fit line we’d like to convert our response variable to a [0,1] binary coded variable).

default %>%

  mutate(prob = ifelse(default == “Yes”, 1, 0)) %>%

  ggplot(aes(balance, prob)) +

  geom_point(alpha = .15) +

  geom_smooth(method = “glm”, method.args = list(family = “binomial”)) +

  ggtitle(“Logistic regression model fit”) +

  xlab(“Balance”) +

  ylab(“Probability of Default”)

https://uc-r.github.io/public/images/analytics/logistic_regression/plot2-1.png

Similar to rectilinear regression we will assess the model using summary or glance. Note that the coefficient output format is analogous to what we saw in linear regression; however, the goodness-of-fit details at rock bottom of summary differ. We’ll get into this more later but just note that you simply see the word deviance. Deviance may be a nalogous to the sum of squares calculations in rectilinear regression and is a measure of the shortage of fit the info during a logistic regression model. The null deviance represents the difference between a model with only the intercept (which means “no predictors”) and a saturated model (a model with a theoretically perfect fit). The goal is for the model deviance (noted as Residual deviance) to be lower; smaller values indicate better fit. during this respect, the null model provides a baseline upon which to match predictor models. summary(model1)

## 

## Call:

## glm(formula = default ~ balance, family = “binomial”, data = train)

## 

## Deviance Residuals: 

##     Min   1Q Median       3Q Max  

## -2.2905  -0.1395 -0.0528  -0.0189 3.3346  

## 

## Coefficients:

##               Estimate Std. Error z value Pr(>|z|)    

## (Intercept) -1.101e+01  4.887e-01 -22.52 <2e-16 ***

## balance      5.669e-03 2.949e-04   19.22 <2e-16 ***

## —

## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

## 

## (Dispersion parameter for binomial family taken to be 1)

## 

##     Null deviance: 1723.03  on 6046 degrees of freedom

## Residual deviance:  908.69 on 6045 degrees of freedom

## AIC: 912.69

## 

## Number of Fisher Scoring iterations: 8

Assessing Coefficients

The below table shows the coefficient estimates and related information that result from fitting a logistic regression model so as to predict the probability of default = Yes using balance. Bear in mind that the coefficient estimates from logistic regression characterize the connection between the predictor and response variable on a log-odds scale (see Ch. 3 of ISLR1 for more details). Thus, we see that  ^β1=0.0057β^1=0.0057 ; this means that a rise in balance is related to a rise within the probability of default. To be precise, a one-unit increase in balance is related to a rise within the log odds of default by 0.0057 units.

tidy(model1)

##          term estimate   std.error statistic     p.value

## 1 (Intercept) -11.006277528 0.488739437 -22.51972 2.660162e-112

## 2     balance 0.005668817 0.000294946  19.21985 2.525157e-82

We can further interpret the balance coefficient as – for each one dollar increase in monthly balance carried, the chances of the customer defaulting increases by an element of 1.0057.

exp(coef(model1))

##  (Intercept)      balance 

## 1.659718e-05 1.005685e+00

 Many aspects of the coefficient output are almost like those discussed within the rectilinear regression output. for instance , we will measure the arrogance intervals and accuracy of the coefficient estimates by computing their standard errors. as an example , β1β^1 has a p-value < 2e-16  features a p-value < 2e-16 suggesting a statistically significant relationship between balance carried and therefore the probability of defaulting. we will also use the quality errors to urge confidence intervals as we did within the rectilinear regression tutorial:

confint(model1)

##                     2.5 % 97.5 %

## (Intercept) -12.007610373 -10.089360652

## balance       0.005111835 0.006269411

Making Predictions

Once the coefficients are estimated, it’s an easy interest compute the probability of default for any given mastercard balance. Mathematically, using the coefficient estimates from our model we predict that the default probability for a private with a balance of $1,000 is a smaller amount than 0.5%

^p(X)=e^β0+^β1X1+e^β0+^β1X=e−11.0063+0.0057×10001+e−11.0063+0.0057×1000=0.004785(3)

We can predict the probability of defaulting in R using the predict function (be bound to include type = “response”). Here we compare the probability of defaulting supported balances of $1000 and $2000. As you’ll see because the balance moves from $1000 to $2000 the probability of defaulting increases signficantly, from 0.5% to 58%!

One also can use qualitative predictors with the logistic regression model. As an example, we will fit a model that uses the scholar variable.

The coefficient related to student = Yes is positive, and therefore the associated p-value is statistically significant. this means that students tend to possess higher default probabilities than non-students. In fact, this model suggests that a student has nearly twice the chances of defaulting than non-students. However, within the next section we’ll see why.

Multiple Logistic Regression

We can also extend our model as seen in Eq. 1 in order that we will predict a binary response using multiple predictors where X=(X1,…,Xp)X=(X1,…,Xp) are p predictors:

p(X)=eβ0+β1X+⋯+βpXp1+eβ0+β1X+⋯+βpXp X

Let’s plow ahead and fit a model that predicts the probability of default supported the balance, income (in thousands of dollars), and student status variables. there’s a surprising result here. The p-values related to balance and student=Yes status are very small, indicating that every of those variables is related to the probability of defaulting. However, the coefficient for the scholar variable is negative, indicating that students are less likely to default than non-students. In contrast, the coefficient for the scholar variable in model 2, where we predicted the probability of default based only on student status, indicated that students have a greater probability of defaulting. What gives?

model3 <- glm(default ~ balance + income + student, family = “binomial”, data = train)

tidy(model3)

##          term estimate    std.error statistic     p.value

## 1 (Intercept) -1.090704e+01 6.480739e-01 -16.8299277 1.472817e-63

## 2     balance 5.907134e-03 3.102425e-04  19.0403764 7.895817e-81

## 3      income -5.012701e-06 1.078617e-05  -0.4647343 6.421217e-01

## 4  studentYes -8.094789e-01 3.133150e-01  -2.5835947 9.777661e-03

The right-hand panel of the figure below provides an evidence for this discrepancy. The variables student and balance are correlated. Students tend to carry higher levels of debt, which is successively related to higher probability of default. In other words, students are more likely to possess large mastercard balances, which, as we all know from the left-hand panel of the below figure, tend to be related to high default rates. Thus, albeit a private student with a given mastercard balance will tend to possess a lower probability of default than a non-student with an equivalent mastercard balance, the very fact that students on the entire tend to possess higher mastercard balances means overall, students tend to default at a better rate than non-students. this is often a crucial distinction for a mastercard company that’s trying to work out to whom they ought to offer credit. A student is riskier than a non-student if no information about the student’s mastercard balance is out there . However, that student is a smaller amount risky than a non-student with an equivalent mastercard balance!

https://uc-r.github.io/public/images/analytics/logistic_regression/plot3-1.png

This simple example illustrates the risks and subtleties related to performing regressions involving only one predictor when other predictors can also be relevant. The results obtained using one predictor could also be quite different from those obtained using multiple predictors, especially when there’s correlation among the predictors. This phenomenon is understood as confounding.

In the case of multiple predictor variables sometimes we would like to know which variable is that the most influential in predicting the response (Y) variable. we will do that with varImp from the caret package. Here, we see that balance is that the most vital by an outsized margin whereas student status is a smaller amount important followed by income (which was found to be insignificant anyways (p = .64)).

caret::varImp(model3)

##               Overall

## balance    19.0403764

## income      0.4647343

## studentYes  2.5835947

As before, we can easily make predictions with this model. For example, a student with a credit card balance of $1,500 and an income of $40,000 has an estimated probability of default of

^p(X)=e−10.907+0.00591×1,500−0.00001×40−0.809×11+e−10.907+0.00591×1,500−0.00001×40−0.809×1=0.054

A non-student with the same balance and income has an estimated probability of default of

^p(X)=e−10.907+0.00591×1,500−0.00001×40−0.809×01+e−10.907+0.00591×1,500−0.00001×40−0.809×0=0.114p^(X)=e−10.907+0.00591×1,500−0.00001×40−0.809×01+e−10.907+0.00591×1,500−0.00001×40−0.809×0=0.114

new.df <- tibble(balance = 1500, income = 40, student = c(“Yes”, “No”))

predict(model3, new.df, type = “response”)

##          1 2 

## 0.05437124 0.11440288