Evaluation of the Logistic Regression : The Hosmer-Lemeshow test
Author
Jumbong Junior
Hosmer-Lemeshow Test
Model performance is commonly evaluated based on two main criteria. Discrimination and calibration. The discrimination describes the ability of the model to assign higher probabilities of the outcomes to those observations that actually experience the outcome. A recognized metric for assessing a model’s discrimination is the area under the receiver operating characteristic (ROC) curve. Calibration or fit, on the other hand, captures how accurately the predicted probabilities is close to the actual occurence of the outcome. Several tests and graphical methods have been proposed to assess the calibration of a model, which is often referred to as “goodness of fit.” Among the goodness of fit tests, the Hosmer-Lemeshow (HL) test is the most widely applied approach (Nattino, 2020).
Central idea of the test : The main idea of the test is to divide the observations into groups and compute a chi2 statistic that reflects the overall mismatch between the observed number of events and the expected number of events in each group-outcome category.
As with most goodness of fit tests, the HL test is designed to decide between a null hypothesis of perfect fit, where the probabilities assumed by the model are hypothesized to coincide with the real probabilities, and a general alternative hypothesis of nonperfect fit. Let’s present the framework of the test.
The value of the test statistic is : \[
HL = \sum_{g=1}^{G} \frac{(O_{D,k} - E_{D,k})^2}{E_{D,k}} + \frac{(O_{ND,k} - E_{ND,k})^2}{E_{ND,k}}
\]
Where :
\(O_{D,k}\) and \(E_{D,k}\) are respectively the number of observed events (default for example) and the number of expected events in the group k.
\(O_{ND,k}\) and \(E_{ND,k}\) are respectively the number of observed non-events (non-default for example) and the number of expected non-events in the group k.
G is the number of groups (typically 10).
Background : The Hosmer-Lemeshow goodness of fit test.
To understand the Hosmer-Lemeshow test, it is important to identify key concepts and definitions:
Observed Proportions of Events: The proportion of observed successes or failures in the data.
Expected Event Rates: The predicted probabilities of success or failure derived from the logistic regression model.
Logistic Regression: Logistic regression analyzes the relationship between a dependent categorical variable and a set of independent explanatory variables.
Dependent Variable: The dependent variable has two categories:
Failure or Success
Default or Non-default
Goal of Logistic Regression:
Predict the probability of an event (“success” or “default”) for any given value of the predictor(s).
Logistic Regression Model
Let Y be the dependent variable: \(Y = \begin{cases}
1 & \text{if default} \\
0 & \text{if not default}
\end{cases}\)
Let \(( X_1, \ldots, X_p )\) represent the set of p variables.
The logistic regression equation can be written as: \[
\text{logit}(P(Y = 1 \mid X)) = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p
\]
Here, $_0, _1, , _p $ are parameters to be estimated.
Computing Probabilities
For a dataset with N individuals:
For each individual i, compute the probability of success: \[
P_i = P(Y_i = 1 \mid X_1^i, \ldots, X_p^i) = \frac{1}{1 + \exp{-(\beta_0 + \beta_1 X_1^i + \ldots + \beta_p X_p^i)}}
\]
\(P_i\) represents the expected probability for individual i .
Create a table of individuals with their observed outcomes Y and predicted probabilities \(P_i\).
Example Table
After computing \(P_i\) for all individuals, the results can be summarized in a table:
Individual
Event (\(Y\))
\(P_i\)
1
1
0.8
2
0
0.2
3
1
0.9
4
0
0.1
…
…
…
N
1
0.95
If the logistic regression fits well, the predicted probability \(P_i\) for each individual should align closely with the observed outcome Y Specifically:
When Y=1 (the event occurred), \(P_i\) should be close to 1, reflecting high confidence in predicting the event.
When Y = 0 (the event did not occur), \(P_i\) should be close to 0, reflecting high confidence in predicting non-occurrence.
Performing the Hosmer-Lemeshow Test
After this stage, it is not difficult to carry out the Hosmer-Lemeshow test. What is necessary is ordering and grouping individuals. The Hosmer-Lemeshow test can be performed by dividing the predicted probabilities (Pi) into deciles (10 groups based on percentile ranks) and then computing the Chi-square statistic that compares the predicted to the observed frequencies (Hyeoun-AE, 2013).
The value of the Hosmer-Lemeshow statistic is given by:
Calculate each term and sum them to obtain the contribution of group g to the overall Hosmer-Lemeshow statistic.
Interpreting the Hosmer-Lemeshow Test
Under the null hypothesis (the observed default numbers correspond to the expected default numbers), the test statistic asymptotically follows the $^2 $ distribution with G - 2 degrees of freedom, where G is the number of groups.
If the p-value is higher than 5%, it indicates a small statistic and thus a limited gap between observed defaults and expected defaults, suggesting a good model fit.
If the p-value is lower than 5%, it indicates a significant discrepancy between observed and expected values, suggesting a poor model fit.
Caution: Hosmer and Lemeshow recommend avoiding the use of this test when there is a small number of observations (less than 400), as the test may yield unreliable results.
Application in python
To implement the Hosmer-Lemeshow test in python, a dataset is essential. The dependent variable or the event Y is generated using the bernouilli distribution. To simplify, only two independent continuous variables \(X_1\) and \(X_2\) are drawn from the normal distribution. In this example, the dataset have 1000 individuals.
#Entry : Python package, the number of indivuals and the parameter of the normal law : 0.5#Output : A dataset of 1000 individuals.# Objective : Generate a dataset of 1000 individuals. Generate the Y variable using the bernouill law and X1 and X2 using the normal law.import jupyter_cacheimport pandas as pdimport numpy as np # Set random seed for reproducibilitynp.random.seed(42)# Generate data for 1000 individualsn_individuals =1000# Generate X1 and X2 from a normal distributionX1 = np.random.normal(loc=0, scale=1, size=n_individuals)X2 = np.random.normal(loc=0, scale=1, size=n_individuals)# Generate Y using a Bernoulli distribution with a fixed probability of 0.5Y = np.random.binomial(1, 0.5, size=n_individuals)# Create a DataFramedata = pd.DataFrame({'X1': X1,'X2': X2,'Y': Y})# Display the first few rows of the datasetdata.head(4)
X1
X2
Y
0
0.496714
1.399355
0
1
-0.138264
0.924634
0
2
0.647689
0.059630
0
3
1.523030
-0.646937
0
The parameters of logistic regression is estimated and the probability of the event \(P_(Y =1 | X_1, X_2)\).
# Entry : the data dataset data_2.# Output : A dataset with the event variable Y and the Pi expected probability computed for each individuals.# Objective : Estimating of parameters and compute pi using sm.Logit.import statsmodels.api as sm# Prepare the data for logistic regressionX = data[['X1', 'X2']]X = sm.add_constant(X) # Add an intercept termy = data['Y']# Fit the logistic regression modellogit_model = sm.Logit(y, X)result = logit_model.fit()# Predict the probability of Y = 1 for each individualdata['P(Y=1 | X1, X2)'] = result.predict(X)# Add an Individual ID columndata['Individual'] =range(1, len(data) +1)data_2 = data = data[['Individual', 'Y', 'P(Y=1 | X1, X2)']]data_2.head(5)
Optimization terminated successfully.
Current function value: 0.691401
Iterations 4
Individual
Y
P(Y=1 | X1, X2)
0
1
0
0.531302
1
2
0
0.508046
2
3
0
0.520704
3
4
0
0.537612
4
5
1
0.502806
To compute the Hosmer-test, individuals are divided into deciles (10 g groups) based on their predicted probabilities \(P_i\). For each group g, compute the Observed number of the event in each group g (\(O_g\)) and the expected number of event in each group g (\(E_g\)).
# Entry : the dataset data_2.# Output : the group of deciles with the O_g observed number of event by group and the E_g expected number of event by group.# Create decile groups based on the predicted probabilitiesdata_2['Decile'] = pd.qcut(data_2['P(Y=1 | X1, X2)'], q=10, labels=False) +1# Calculate Observed (Og) and Expected (Eg) numbers for each decile groupgrouped = data_2.groupby('Decile').agg( n_g=('Y', 'size'), # Number of individuals per group O_g=('Y', 'sum'), # Observed number of events E_g=('P(Y=1 | X1, X2)', 'sum') # Expected number of events).reset_index()grouped.head(5)
Decile
n_g
O_g
E_g
0
1
100
44
45.189216
1
2
100
44
47.344473
2
3
100
47
48.343571
3
4
100
52
49.132066
4
5
100
48
49.877603
Finally, it is easier to compute the value of the Hosmer-Lemeshow statistic.
# Entry : the grouped dataset.# Output : the Hosmer-Lemeshow test.# Objective : Using the grouped data in order to calculate the Hosmer-Lemeshow statistic HL.# Compute the Hosmer-Lemeshow test statistic# Calculate the HL term for each groupgrouped['HL_term'] = ((grouped['O_g'] - grouped['E_g']) **2) / grouped['E_g'] +\ ((grouped['n_g'] - grouped['O_g'] - (grouped['n_g'] - grouped['E_g'])) **2) / (grouped['n_g'] - grouped['E_g'])# Calculate the total HL statisticHL_statistic = grouped['HL_term'].sum()# Degrees of freedom: Number of groups - 2degrees_of_freedom =len(grouped) -2# Display the Hosmer-Lemeshow test resultsHL_statistic, degrees_of_freedom
(np.float64(5.49669887032658), 8)
Under the null hypothesis (the observed default numbers correspond to the expected default numbers) the test statistic asymptotically follows a \(\chi_2\) distribution with 8 degrees of freedom. The p value is given by the code below.
# Entry : the HL_statistic and the degrees_of_freedom.# Output : the P value of the test.# Objective : Knowing that the HL follows the chi2 distribution with 8 degrees of freedom, we compute the p value.from scipy.stats import chi2# Calculate the p-value for the HL testp_value =1- chi2.cdf(HL_statistic, degrees_of_freedom)# Display the p-valuep_value
np.float64(0.7034057045508038)
In this example, the p value is 0.7 higher than 5%, it indicates as small statistic and so a limited gap between observed default and expected ones and so a good model fit. Put in nutshell, two funcions can be used to compute the Hosmer-Lemeshow test. The first one is the homser_lemeshow_test and the second one is the hosmer_lemeshow_test2.
# Entry : A dataset which contains The dependent variable Y, the predicted probabilities P, or the discretized probabilities P or rating classes if possible and the number of groups if we don't have the rating classes.# Output : The Hosmer-Lemeshow test.# Objective : Compute the Hosmer-Lemeshow test.def homser_lemeshow_test(data, depend_variable, pred_prb, rating_class =None, n_groups =10):""" Compute the Hosmer-Lemeshow test. Parameters: data (DataFrame): The dataset containing the dependent variable Y and the predicted probabilities P. depend_variable (str): The name of the dependent variable Y. pred_prb (str): The name of the predicted probabilities P. rating_class (str): The name of the rating classes if available. n_groups (int): The number of groups for the test. Returns: HL_statistic (float): The Hosmer-Lemeshow test statistic. degrees_of_freedom (int): The degrees of freedom for the test. p_value (float): The p-value for the test. """# Create decile groups based on the predicted probabilitiesif rating_class isNone: data['Decile'] = pd.qcut(data[pred_prb], q=n_groups, labels=False) +1else: data['Decile'] = data[rating_class]# Calculate Observed (Og) and Expected (Eg) numbers for each decile group grouped = data.groupby('Decile').agg( n_g=(depend_variable, 'size'), # Number of individuals per group O_g=(depend_variable, 'sum'), # Observed number of events E_g=(pred_prb, 'sum') # Expected number of events ).reset_index()# Compute the HL term for each group grouped['HL_term'] = ((grouped['O_g'] - grouped['E_g']) **2) / grouped['E_g'] +\ ((grouped['n_g'] - grouped['O_g'] - (grouped['n_g'] - grouped['E_g'])) **2) / (grouped['n_g'] - grouped['E_g'])# Calculate the total HL statistic HL_statistic = grouped['HL_term'].sum()# Degrees of freedom: Number of groups - 2 degrees_of_freedom =len(grouped) -2# Calculate the p-value for the HL test p_value =1- chi2.cdf(HL_statistic, degrees_of_freedom)return HL_statistic, degrees_of_freedom, p_value# Test the function with the example dataHL_statistic, degrees_of_freedom, p_value = homser_lemeshow_test(data_2, 'Y', 'P(Y=1 | X1, X2)')HL_statistic, degrees_of_freedom, p_value
def hosmer_lemeshow_test2(data, depend_variable, pred_prb, rating_class =None, n_groups =10):""" Compute the Hosmer-Lemeshow test. Parameters: data (DataFrame): The dataset containing the dependent variable Y and the predicted probabilities P. depend_variable (str): The name of the dependent variable Y. pred_prb (str): The name of the predicted probabilities P. rating_class (str): The name of the rating classes if available. n_groups (int): The number of groups for the test. Returns: HL_statistic (float): The Hosmer-Lemeshow test statistic. degrees_of_freedom (int): The degrees of freedom for the test. p_value (float): The p-value for the test. """# Create decile groups based on the predicted probabilitiesif rating_class isNone: data['Decile'] = pd.qcut(data[pred_prb], q=n_groups, labels=False) +1else: data['Decile'] = data[rating_class]# Compute the observed and expected number of events for each decile group obsevents_1 = data[depend_variable].groupby(data['Decile']).sum() obsevents_0 = data[depend_variable].groupby(data['Decile']).count() - obsevents_1 expevents_1 = data[pred_prb].groupby(data['Decile']).sum() expevents_0 = data[pred_prb].groupby(data['Decile']).count() - expevents_1 hosmer = ((obsevents_1 - expevents_1) **2/ expevents_1).sum() + ((obsevents_0 - expevents_0) **2/ expevents_0).sum()# Degrees of freedom: Number of groups - 2 degrees_of_freedom =len(obsevents_1) -2# Calculate the p-value for the HL test p_value =1- chi2.cdf(hosmer, degrees_of_freedom)return hosmer, degrees_of_freedom, p_value# Test the function with the example datahosmer, degrees_of_freedom, p_value = hosmer_lemeshow_test2(data_2, 'Y', 'P(Y=1 | X1, X2)')hosmer, degrees_of_freedom, p_value
The Hosmer-Lemeshow test is a valuable tool for evaluating the goodness-of-fit of a logistic regression model. By comparing the observed and expected event rates in decile groups, the test provides insights into the model’s predictive performance. A high p-value suggests a good model fit, while a low p-value indicates a significant discrepancy between observed and expected values. However, caution should be exercised when using the test with small sample sizes, as it may yield unreliable results.