Evaluation of the Logistic Regression : The Hosmer-Lemeshow test

Author

Jumbong Junior

Hosmer-Lemeshow Test

Model performance is commonly evaluated based on two main criteria. Discrimination and calibration. The discrimination describes the ability of the model to assign higher probabilities of the outcomes to those observations that actually experience the outcome. A recognized metric for assessing a model’s discrimination is the area under the receiver operating characteristic (ROC) curve. Calibration or fit, on the other hand, captures how accurately the predicted probabilities is close to the actual occurence of the outcome. Several tests and graphical methods have been proposed to assess the calibration of a model, which is often referred to as “goodness of fit.” Among the goodness of fit tests, the Hosmer-Lemeshow (HL) test is the most widely applied approach (Nattino, 2020).

Central idea of the test : The main idea of the test is to divide the observations into groups and compute a chi2 statistic that reflects the overall mismatch between the observed number of events and the expected number of events in each group-outcome category.

As with most goodness of fit tests, the HL test is designed to decide between a null hypothesis of perfect fit, where the probabilities assumed by the model are hypothesized to coincide with the real probabilities, and a general alternative hypothesis of nonperfect fit. Let’s present the framework of the test.

The value of the test statistic is : \[ HL = \sum_{g=1}^{G} \frac{(O_{D,k} - E_{D,k})^2}{E_{D,k}} + \frac{(O_{ND,k} - E_{ND,k})^2}{E_{ND,k}} \]

Where :

$O_{D,k}$ and $E_{D,k}$ are respectively the number of observed events (default for example) and the number of expected events in the group k.
$O_{ND,k}$ and $E_{ND,k}$ are respectively the number of observed non-events (non-default for example) and the number of expected non-events in the group k.
G is the number of groups (typically 10).

Background : The Hosmer-Lemeshow goodness of fit test.

To understand the Hosmer-Lemeshow test, it is important to identify key concepts and definitions:

Observed Proportions of Events: The proportion of observed successes or failures in the data.
Expected Event Rates: The predicted probabilities of success or failure derived from the logistic regression model.
Logistic Regression: Logistic regression analyzes the relationship between a dependent categorical variable and a set of independent explanatory variables.
Dependent Variable: The dependent variable has two categories:
- Failure or Success
- Default or Non-default
Goal of Logistic Regression:
- Predict the probability of an event (“success” or “default”) for any given value of the predictor(s).

Logistic Regression Model

Let Y be the dependent variable: $Y = \begin{cases} 1 & \text{if default} \\ 0 & \text{if not default} \end{cases}$
Let $( X_1, \ldots, X_p )$ represent the set of p variables.
The logistic regression equation can be written as: \[ \text{logit}(P(Y = 1 \mid X)) = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p \]
Solving for P: \[ P(Y = 1 \mid X) = \frac{1}{1 + \exp{-(\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p)}} \]
Here, $_0, _1, , _p $ are parameters to be estimated.

Computing Probabilities

For a dataset with N individuals:

For each individual i, compute the probability of success: \[ P_i = P(Y_i = 1 \mid X_1^i, \ldots, X_p^i) = \frac{1}{1 + \exp{-(\beta_0 + \beta_1 X_1^i + \ldots + \beta_p X_p^i)}} \]
$P_i$ represents the expected probability for individual i .
Create a table of individuals with their observed outcomes Y and predicted probabilities $P_i$.

Example Table

After computing $P_i$ for all individuals, the results can be summarized in a table:

Individual	Event ($Y$)	$P_i$
1	1	0.8
2	0	0.2
3	1	0.9
4	0	0.1
…	…	…
N	1	0.95

If the logistic regression fits well, the predicted probability $P_i$ for each individual should align closely with the observed outcome Y Specifically:

When Y=1 (the event occurred), $P_i$ should be close to 1, reflecting high confidence in predicting the event.
When Y = 0 (the event did not occur), $P_i$ should be close to 0, reflecting high confidence in predicting non-occurrence.

Performing the Hosmer-Lemeshow Test

After this stage, it is not difficult to carry out the Hosmer-Lemeshow test. What is necessary is ordering and grouping individuals. The Hosmer-Lemeshow test can be performed by dividing the predicted probabilities (Pi) into deciles (10 groups based on percentile ranks) and then computing the Chi-square statistic that compares the predicted to the observed frequencies (Hyeoun-AE, 2013).

The value of the Hosmer-Lemeshow statistic is given by:

\[ H = \sum_{g=1}^{G} \frac{(O_g - E_g)^2}{E_g} + \frac{(n_g - O_g - (n_g - E_g))^2}{n_g - E_g} \]

Where:

G : Number of groups (typically 10).
$O_g$ : Observed number of events in group g.
$E_g$ : Expected number of events in group g or the sum of predicted probabilities for the group g ($\sum_{i =1}^{n_g}P_i$).
$n_g$ : Total number of individuals in group g.

Example of Computing Each Element in One Group

To illustrate how the statistic is computed for a single group g :

Suppose the group contains $n_g = 100$ individuals.
Out of these, $O_g = 20$ individuals experienced the event (e.g., default).
The sum of predicted probabilities for the group is $E_g = 18.5$.

Using the formula:

\[ H_g = \frac{(O_g - E_g)^2}{E_g} + \frac{(n_g - O_g - (n_g - E_g))^2}{n_g - E_g} \]

Substitute the values:

First term: $\frac{(20 - 18.5)^2}{18.5}$
Second term: $\frac{(100 - 20 - (100 - 18.5))^2}{100 - 18.5}$

Calculate each term and sum them to obtain the contribution of group g to the overall Hosmer-Lemeshow statistic.

Interpreting the Hosmer-Lemeshow Test

Under the null hypothesis (the observed default numbers correspond to the expected default numbers), the test statistic asymptotically follows the $^2 $ distribution with G - 2 degrees of freedom, where G is the number of groups.

If the p-value is higher than 5%, it indicates a small statistic and thus a limited gap between observed defaults and expected defaults, suggesting a good model fit.
If the p-value is lower than 5%, it indicates a significant discrepancy between observed and expected values, suggesting a poor model fit.

Caution: Hosmer and Lemeshow recommend avoiding the use of this test when there is a small number of observations (less than 400), as the test may yield unreliable results.

Application in python

To implement the Hosmer-Lemeshow test in python, a dataset is essential. The dependent variable or the event Y is generated using the bernouilli distribution. To simplify, only two independent continuous variables $X_1$ and $X_2$ are drawn from the normal distribution. In this example, the dataset have 1000 individuals.

#Entry : Python package, the number of indivuals and the parameter of the normal law : 0.5
#Output : A dataset of 1000 individuals.
# Objective : Generate a dataset of 1000 individuals. Generate the Y variable using the bernouill law and X1 and X2 using the normal law.
import jupyter_cache
import pandas as pd
import numpy as np 


# Set random seed for reproducibility
np.random.seed(42)

# Generate data for 1000 individuals
n_individuals = 1000

# Generate X1 and X2 from a normal distribution
X1 = np.random.normal(loc=0, scale=1, size=n_individuals)
X2 = np.random.normal(loc=0, scale=1, size=n_individuals)

# Generate Y using a Bernoulli distribution with a fixed probability of 0.5
Y = np.random.binomial(1, 0.5, size=n_individuals)

# Create a DataFrame
data = pd.DataFrame({
    'X1': X1,
    'X2': X2,
    'Y': Y
})

# Display the first few rows of the dataset
data.head(4)

	X1	X2
0	0.496714	1.399355
1	-0.138264	0.924634
2	0.647689	0.059630
3	1.523030	-0.646937

The parameters of logistic regression is estimated and the probability of the event $P_(Y =1 | X_1, X_2)$.

# Entry : the data dataset data_2.
# Output : A dataset with the event variable Y and the Pi expected probability computed for each individuals.
# Objective : Estimating of parameters and compute pi using sm.Logit.

import statsmodels.api as sm

# Prepare the data for logistic regression
X = data[['X1', 'X2']]
X = sm.add_constant(X)  # Add an intercept term
y = data['Y']

# Fit the logistic regression model
logit_model = sm.Logit(y, X)
result = logit_model.fit()

# Predict the probability of Y = 1 for each individual
data['P(Y=1 | X1, X2)'] = result.predict(X)

# Add an Individual ID column
data['Individual'] = range(1, len(data) + 1)

data_2 = data = data[['Individual', 'Y', 'P(Y=1 | X1, X2)']]

data_2.head(5)

Optimization terminated successfully.
         Current function value: 0.691401
         Iterations 4

	Individual	Y	P(Y=1 \| X1, X2)
0	1	0	0.531302
1	2	0	0.508046
2	3	0	0.520704
3	4	0	0.537612
4	5	1	0.502806

To compute the Hosmer-test, individuals are divided into deciles (10 g groups) based on their predicted probabilities $P_i$. For each group g, compute the Observed number of the event in each group g ($O_g$) and the expected number of event in each group g ($E_g$).

# Entry : the dataset data_2.
# Output : the group of deciles with the O_g observed number of event by group and the E_g expected number of event by group.


# Create decile groups based on the predicted probabilities
data_2['Decile'] = pd.qcut(data_2['P(Y=1 | X1, X2)'], q=10, labels=False) + 1

# Calculate Observed (Og) and Expected (Eg) numbers for each decile group
grouped = data_2.groupby('Decile').agg(
    n_g=('Y', 'size'),  # Number of individuals per group
    O_g=('Y', 'sum'),  # Observed number of events
    E_g=('P(Y=1 | X1, X2)', 'sum')  # Expected number of events
).reset_index()

grouped.head(5)

	Decile	n_g	O_g	E_g
0	1	100	44	45.189216
1	2	100	44	47.344473
2	3	100	47	48.343571
3	4	100	52	49.132066
4	5	100	48	49.877603

Finally, it is easier to compute the value of the Hosmer-Lemeshow statistic.

# Entry : the grouped dataset.
# Output : the Hosmer-Lemeshow test.
# Objective : Using the grouped data in order to calculate the Hosmer-Lemeshow statistic HL.

# Compute the Hosmer-Lemeshow test statistic

# Calculate the HL term for each group
grouped['HL_term'] = ((grouped['O_g'] - grouped['E_g']) ** 2) / grouped['E_g'] + \
                     ((grouped['n_g'] - grouped['O_g'] - (grouped['n_g'] - grouped['E_g'])) ** 2) / (grouped['n_g'] - grouped['E_g'])

# Calculate the total HL statistic
HL_statistic = grouped['HL_term'].sum()

# Degrees of freedom: Number of groups - 2
degrees_of_freedom = len(grouped) - 2

# Display the Hosmer-Lemeshow test results
HL_statistic, degrees_of_freedom

(np.float64(5.49669887032658), 8)

Under the null hypothesis (the observed default numbers correspond to the expected default numbers) the test statistic asymptotically follows a $\chi_2$ distribution with 8 degrees of freedom. The p value is given by the code below.

# Entry : the HL_statistic and the degrees_of_freedom.
# Output : the P value of the test.
# Objective : Knowing that the HL follows the chi2 distribution with 8 degrees of freedom, we compute the p value.

from scipy.stats import chi2

# Calculate the p-value for the HL test
p_value = 1 - chi2.cdf(HL_statistic, degrees_of_freedom)

# Display the p-value
p_value

np.float64(0.7034057045508038)

In this example, the p value is 0.7 higher than 5%, it indicates as small statistic and so a limited gap between observed default and expected ones and so a good model fit. Put in nutshell, two funcions can be used to compute the Hosmer-Lemeshow test. The first one is the homser_lemeshow_test and the second one is the hosmer_lemeshow_test2.

# Entry : A dataset which contains The dependent variable Y, the predicted probabilities P, or the discretized probabilities P or rating classes if possible and the number of groups if we don't have the rating classes.
# Output : The Hosmer-Lemeshow test.
# Objective : Compute the Hosmer-Lemeshow test.

def homser_lemeshow_test(data, depend_variable, pred_prb, rating_class = None, n_groups = 10):
    """
    Compute the Hosmer-Lemeshow test.
    
    Parameters:
    data (DataFrame): The dataset containing the dependent variable Y and the predicted probabilities P.
    depend_variable (str): The name of the dependent variable Y.
    pred_prb (str): The name of the predicted probabilities P.
    rating_class (str): The name of the rating classes if available.
    n_groups (int): The number of groups for the test.
    
    Returns:
    HL_statistic (float): The Hosmer-Lemeshow test statistic.
    degrees_of_freedom (int): The degrees of freedom for the test.
    p_value (float): The p-value for the test.
    """
    # Create decile groups based on the predicted probabilities
    if rating_class is None:
        data['Decile'] = pd.qcut(data[pred_prb], q=n_groups, labels=False) + 1
    else:
        data['Decile'] = data[rating_class]
    
    # Calculate Observed (Og) and Expected (Eg) numbers for each decile group
    grouped = data.groupby('Decile').agg(
        n_g=(depend_variable, 'size'),  # Number of individuals per group
        O_g=(depend_variable, 'sum'),  # Observed number of events
        E_g=(pred_prb, 'sum')  # Expected number of events
    ).reset_index()
    
    # Compute the HL term for each group
    grouped['HL_term'] = ((grouped['O_g'] - grouped['E_g']) ** 2) / grouped['E_g'] + \
                         ((grouped['n_g'] - grouped['O_g'] - (grouped['n_g'] - grouped['E_g'])) ** 2) / (grouped['n_g'] - grouped['E_g'])
    
    # Calculate the total HL statistic
    HL_statistic = grouped['HL_term'].sum()
    
    # Degrees of freedom: Number of groups - 2
    degrees_of_freedom = len(grouped) - 2
    
    # Calculate the p-value for the HL test
    p_value = 1 - chi2.cdf(HL_statistic, degrees_of_freedom)
    
    return HL_statistic, degrees_of_freedom, p_value

# Test the function with the example data
HL_statistic, degrees_of_freedom, p_value = homser_lemeshow_test(data_2, 'Y', 'P(Y=1 | X1, X2)')
HL_statistic, degrees_of_freedom, p_value

(np.float64(5.49669887032658), 8, np.float64(0.7034057045508038))

def hosmer_lemeshow_test2(data, depend_variable, pred_prb, rating_class = None, n_groups = 10):
    """
    Compute the Hosmer-Lemeshow test.
    
    Parameters:
    data (DataFrame): The dataset containing the dependent variable Y and the predicted probabilities P.
    depend_variable (str): The name of the dependent variable Y.
    pred_prb (str): The name of the predicted probabilities P.
    rating_class (str): The name of the rating classes if available.
    n_groups (int): The number of groups for the test.
    
    Returns:
    HL_statistic (float): The Hosmer-Lemeshow test statistic.
    degrees_of_freedom (int): The degrees of freedom for the test.
    p_value (float): The p-value for the test.
    """
    # Create decile groups based on the predicted probabilities
    if rating_class is None:
        data['Decile'] = pd.qcut(data[pred_prb], q=n_groups, labels=False) + 1
    else:
        data['Decile'] = data[rating_class]
    
    # Compute the observed and expected number of events for each decile group
    obsevents_1 = data[depend_variable].groupby(data['Decile']).sum()
    obsevents_0 = data[depend_variable].groupby(data['Decile']).count() - obsevents_1
    expevents_1 = data[pred_prb].groupby(data['Decile']).sum()
    expevents_0 = data[pred_prb].groupby(data['Decile']).count() - expevents_1
    hosmer = ((obsevents_1 - expevents_1) ** 2 / expevents_1).sum() + ((obsevents_0 - expevents_0) ** 2 / expevents_0).sum()

    # Degrees of freedom: Number of groups - 2
    degrees_of_freedom = len(obsevents_1) - 2

    # Calculate the p-value for the HL test
    p_value = 1 - chi2.cdf(hosmer, degrees_of_freedom)

    return hosmer, degrees_of_freedom, p_value

# Test the function with the example data
hosmer, degrees_of_freedom, p_value = hosmer_lemeshow_test2(data_2, 'Y', 'P(Y=1 | X1, X2)')
hosmer, degrees_of_freedom, p_value

(np.float64(5.496698870326581), 8, np.float64(0.7034057045508036))

Conclusion

The Hosmer-Lemeshow test is a valuable tool for evaluating the goodness-of-fit of a logistic regression model. By comparing the observed and expected event rates in decile groups, the test provides insights into the model’s predictive performance. A high p-value suggests a good model fit, while a low p-value indicates a significant discrepancy between observed and expected values. However, caution should be exercised when using the test with small sample sizes, as it may yield unreliable results.