regression Module#

Regression models for epidemiological analysis.

This module provides functions for fitting and interpreting regression models commonly used in epidemiology, including logistic regression for binary outcomes and Poisson regression for count data.

Classes#

class episia.stats.regression.RegressionType(value)[source]#

Bases: Enum

Types of regression models.

LINEAR = 'linear'#
LOGISTIC = 'logistic'#
POISSON = 'poisson'#
class episia.stats.regression.ModelSelection(value)[source]#

Bases: Enum

Model selection criteria.

AIC = 'aic'#
BIC = 'bic'#
LIKELIHOOD_RATIO = 'likelihood_ratio'#
class episia.stats.regression.RegressionResult(coefficients, odds_ratios, ci_lower, ci_upper, p_values, variable_names, model_type, n_observations, log_likelihood, aic, bic, convergence, iterations)[source]#

Bases: object

Result object for regression analysis.

Parameters:
__repr__()[source]#

Return repr(self).

Return type:

str

aic: float#
bic: float#
ci_lower: ndarray#
ci_upper: ndarray#
coefficients: ndarray#
convergence: bool#
iterations: int#
log_likelihood: float#
model_type: str#
n_observations: int#
odds_ratios: ndarray#
p_values: ndarray#
predict(X)[source]#

Predict probabilities or expected counts.

Parameters:

X (ndarray) – Design matrix (including intercept if needed)

Returns:

Predicted values

Return type:

ndarray

summary(decimal_places=3)[source]#

Create summary table of regression results.

Parameters:

decimal_places (int) – Number of decimal places for display

Returns:

pandas DataFrame with results

Return type:

DataFrame

variable_names: List[str]#

Functions#

episia.stats.regression.logistic_regression(X, y, variable_names=None, add_intercept=True, method='irls', max_iter=100, tol=1e-06)[source]#

Fit logistic regression model for binary outcomes.

Parameters:
  • X (ndarray) – Design matrix (n_samples, n_features)

  • y (ndarray) – Binary outcome (0 or 1)

  • variable_names (List[str] | None) – Names of predictor variables

  • add_intercept (bool) – Whether to add intercept term

  • method (str) – Fitting method (‘irls’ or ‘newton’)

  • max_iter (int) – Maximum iterations

  • tol (float) – Convergence tolerance

Returns:

RegressionResult object

Return type:

RegressionResult

Example

>>> X = np.array([[1, 25], [1, 30], [1, 35], [0, 40]])
>>> y = np.array([1, 1, 0, 0])
>>> result = logistic_regression(X, y, ['exposed', 'age'])
episia.stats.regression.poisson_regression(X, y, offset=None, variable_names=None, add_intercept=True, max_iter=100, tol=1e-06)[source]#

Fit Poisson regression model for count data.

Parameters:
  • X (ndarray) – Design matrix

  • y (ndarray) – Count outcome (non-negative integers)

  • offset (ndarray | None) – Offset term (e.g., log(person-time))

  • variable_names (List[str] | None) – Names of predictor variables

  • add_intercept (bool) – Whether to add intercept term

  • max_iter (int) – Maximum iterations

  • tol (float) – Convergence tolerance

Returns:

RegressionResult object

Return type:

RegressionResult

episia.stats.regression.likelihood_ratio_test(model_full, model_reduced)[source]#

Perform likelihood ratio test between nested models.

Parameters:
Returns:

Dictionary with test statistics

Return type:

Dict[str, float]

episia.stats.regression.hosmer_lemeshow_test(y_true, y_pred, n_groups=10)[source]#

Hosmer-Lemeshow goodness-of-fit test for logistic regression.

Parameters:
  • y_true (ndarray) – True binary outcomes

  • y_pred (ndarray) – Predicted probabilities

  • n_groups (int) – Number of groups to form

Returns:

Dictionary with test statistics

Return type:

Dict[str, float]

episia.stats.regression.calculate_vif(X)[source]#

Calculate Variance Inflation Factors for multicollinearity detection.

Parameters:

X (ndarray) – Design matrix (without intercept)

Returns:

Dictionary with VIF for each variable

Return type:

Dict[str, float]

episia.stats.regression.stepwise_selection(X, y, model_type=RegressionType.LOGISTIC, direction='both', criterion=ModelSelection.AIC, max_vars=None)[source]#

Perform stepwise variable selection.

Parameters:
  • X (ndarray) – Design matrix

  • y (ndarray) – Outcome

  • model_type (RegressionType) – Type of regression model

  • direction (str) – ‘forward’, ‘backward’, or ‘both’

  • criterion (ModelSelection) – Selection criterion

  • max_vars (int | None) – Maximum number of variables to include

Returns:

Dictionary with selected model and steps

Return type:

Dict

episia.stats.regression.roc_auc_from_logistic(model, X, y)[source]#

Calculate ROC AUC from logistic regression model.

Parameters:
  • model (RegressionResult) – Fitted logistic regression model

  • X (ndarray) – Design matrix (with intercept if model has it)

  • y (ndarray) – True outcomes

Returns:

AUC value

Return type:

float

episia.stats.regression.interaction_term(X1, X2, center=True)[source]#

Create interaction term for regression.

Parameters:
  • X1 (ndarray) – First variable

  • X2 (ndarray) – Second variable

  • center (bool) – Whether to center variables before multiplying

Returns:

Interaction term

Return type:

ndarray

Examples#

Logistic regression:

import numpy as np
from episia.stats.regression import logistic_regression

# Data: exposure, age, outcome
X = np.array([[1, 25], [1, 30], [1, 35], [0, 40], [0, 45], [0, 50]])
y = np.array([1, 1, 0, 0, 0, 1])

result = logistic_regression(
    X, y,
    variable_names=['exposed', 'age'],
    add_intercept=True
)

print(result.summary())

# Extract odds ratios
for i, var in enumerate(result.variable_names):
    print(f"{var}: OR={result.odds_ratios[i]:.2f} "
          f"(95% CI: {result.ci_lower[i]:.2f}-{result.ci_upper[i]:.2f})")

Poisson regression:

from episia.stats.regression import poisson_regression

# Count data with offset (log person-time)
X = np.array([[1, 0], [1, 1], [0, 0], [0, 1]])
y = np.array([5, 12, 3, 8])
offset = np.log([100, 100, 100, 100])  # person-time

result = poisson_regression(
    X, y, offset=offset,
    variable_names=['exposed', 'age']
)
print(result.summary())

Likelihood ratio test:

from episia.stats.regression import likelihood_ratio_test

# Full model vs reduced model
lrt = likelihood_ratio_test(full_model, reduced_model)
print(f"LR test: χ²={lrt['lr_statistic']:.3f}, p={lrt['p_value']:.4f}")

Multicollinearity check:

vif = calculate_vif(X)
for var, value in vif.items():
    print(f"{var}: VIF={value:.2f}")