Logistic Regression: The Definitive Guide (2021)¶

This guide will cover

What is logistic regression
How to fit logistic regression models
How to apply the latest research in hyperparameter optimization to prevent overfitting
How to use logistic regression with more than two classes

Contents

Introduction
- Why Not Linear Regression?
- The Logistic Function
Fitting Logistic Regression
Overfitting and Regularization
Handling More Than Two Classes
- Example: Iris Species

Introduction ¶

Logistic regression is a supervised learning algorithm for classification tasks.

In classification tasks, we’re given a sample training data set consisting of feature vectors

\boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_n, \quad \boldsymbol{x}_i \in \mathbb{R}^p,

and categorical target values

y_1, y_2, \dots, y_n, \quad y_i \in \left\{0, 1, \dots, K-1\right\}.

We seek to fit a model so that given new out-of-sample feature vectors $\boldsymbol{x}_1', \dots, \boldsymbol{x}_m'$ we can predict their associated target values.

Real-world examples of classification tasks include

The Wisconsin Breast Cancer Data Set where the task is to predict whether a mass of skin is malignant or not given physical properties of the mass such as thickness and shape.
The Iris Data Set where the task is to predict what species of iris a flower is given its septal and petal length and width.

What form could a categorical prediction from a model take? The simplest form is for a model to predict which categorical value is most likely; but more useful, is for a model to predict a probability distribution, where for each category $C_k$ the model assigns a probability $p_k$ such that

0 \leq p_k \leq 1\quad \textrm{and} \quad \sum_k p_k = 1.

Let us consider how we might build a simple model that provides probabilistic predictions for the case of binary classification.

With binary classification, we need only predict the probability of one of the categories, $p_1$ , since $p_0 = 1 - p_1$ . For simplicity, we will use $-1$ and $+1$ to denote the two possible target values so that

y_i \in \left\{-1, +1\right\}.

Why Not Linear Regression?¶

First, let’s address: Why can’t we use linear regression?

Given a feature vector $\boldsymbol{x}$ , a linear regression model with weights $\boldsymbol{w}$ predicts

p_1 = \boldsymbol{x}^\top \boldsymbol{w}.

But predictions from this linear regression model don’t meet the requirements of a probability distribution. If the feature vector $\boldsymbol{x}$ has large enough magnitude, then $p_1$ will be outside of the zero to one range required by a probability distribution.

How can we fix this?

The Logistic Function ¶

Suppose $g\colon \mathbb{R} \to \mathbb{R}$ is a function where

g(u) \leq g(u'), \quad \textrm{if}\; u \leq u',

and

0 \leq g(u) \leq 1, \quad \textrm{for all}\; u\in\mathbb{R}.

Then given $\boldsymbol{w}\in\mathbb{R}^p$ , a model that predicts

p_1 = g(\boldsymbol{x}^\top \boldsymbol{w})

meets the requirements of a probability distribution.

Logistic regression uses the function

g(u) = \frac{1}{1 + \exp(-u)}

called the logistic function.

Note

There are other possible functions that are suitable for binary classification. For example, the probit function

g(u) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^u \exp\left(-\frac{1}{2} Z^2\right)dZ

is also sometimes used.

Fitting Logistic Regression ¶

Now that we’ve described what logistic regression is, how do we go about fitting a model to the training data?

Given weights $\boldsymbol{w}$ , we can compute the negative log likelihood of the training data, called a cost function (or sometimes loss function), as

J(\boldsymbol{w}) = \sum_{i=1}^n \ell_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right),

where

\begin{align*} \ell_i(u) &= \log\left[1 + \exp(-y_i u)\right]. \end{align*}

Let’s see how we can find weights that minimize this cost function; or, equivalently, maximize the log likelihood of the training data.

Caution

We’ll see later why maximizing the log likehood of the training data is a bad idea.

We’re going to find optimal weights by using a trust-region optimizer.

Trust-region Optimization ¶

Trust-region optimizers are a powerful class of iterative, second-order optimization algorithms.

Given a twice-differentiable objective function, $f\colon \mathbb{R}^p \to \mathbb{R}$ , a trust-region optimizer produces a sequence $\{\boldsymbol{x}_k\}$ that converges to an optimum of the objective. The kth iteration of the sequence is generated by updating the previous iteration with a solution to the subproblem

\begin{align*} \boldsymbol{x}_k &= \boldsymbol{x}_{k-1} + \hat{\boldsymbol{s}} \quad\textrm{and} \\ \hat{\boldsymbol{s}} &= \operatorname*{argmin}_{\boldsymbol{s}} \left\{ f\left(\boldsymbol{x}_{k-1}\right) + \nabla f\left(\boldsymbol{x}_{k-1}\right)^\top \boldsymbol{s} + \frac{1}{2} \boldsymbol{s}^\top \nabla^2 f\left(\boldsymbol{x}_{k-1}\right) \boldsymbol{s} \right\} \\ &\phantom{=}\quad\mathrm{s.t.} \quad \| \boldsymbol{s} \| \le \delta_k. \end{align*}

The subproblem minimizes the second-order approximation of $f$ at $\boldsymbol{x}_{k-1}$ within the neighborhood $\| \boldsymbol{s} \| \le \delta_k$ , called the trust-region. Using the trust-region, the optimizer restricts the second-order approximation to areas where it models $f$ well, and it continually expands or shrinks the trust-region to adapt to the accuracy of the local quadratic approximation as the algorithm progresses.

Trust-region optimizers are both efficient and robust, being able to solve convex and non-convex optimization problems.

Cost Function Gradient and Hessian ¶

Let’s start with the gradient. Differentiating the cost function with respect to one of the weights, we get

\begin{align*} \frac{\partial}{\partial w_s} J(\boldsymbol{w}) &= \sum_{i=1}^n \frac{\partial}{\partial w_s} \ell_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right) \\ &= \sum_{i=1}^n x_{is} \dot{\ell}_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right), \end{align*}

where

\dot{\ell}_i(u) = \frac{-y_i}{1 + \exp(y_i u)}.

Note

We use the notation $\dot{f}(u)$ to denote the first derivative of a single variable function:

\dot{f}(u) = \frac{d}{du} f(u).

Similarly, $\ddot{f}(u)$ denotes the second derivative.

Now, for the hessian of $J$ , we derive

\begin{align*} \frac{\partial^2}{\partial w_s \partial w_t} J(\boldsymbol{w}) &= \frac{\partial}{\partial w_s}\left( \sum_{i=1}^n x_{it} \dot{\ell}_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right) \right) \\ &= \sum_{i=1}^n x_{is} x_{it} \ddot{\ell}_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right), \end{align*}

where

\begin{align*} \ddot{\ell}_i(u) &= q p, \\ p &= \frac{1}{1 + \exp(-u)}, \quad\textrm{and} \\ q &= 1 - p. \end{align*}

Example: Breast Cancer Diagonostic (w/o Regularization)¶

Let’s see how this works on an example data set.

Note

We use the bbai package to fit logistic regression models. See Getting Started for instructions on how to install it.

We’ll look at the Wisconsin Breast Cancer Data Set where the task is to determine whether a mass of skin is malignant or not from physical properties of the mass.

The data set is bundled with sklearn and we start by loading and normalizing the data.

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']

Next, we fit a logistic regression model to the data and print out the weights.

import numpy as np
import bbai.glm

model = bbai.glm.LogisticRegression(
    C = np.inf, # This specifies to use no regularization.
                # We'll explain later what this means.
)
model.fit(X, y)

print(model.intercept_[0], "intercept")
for wi, name in zip(model.coef_[0], data['feature_names']):
  print(wi, name)

Overfitting and Regularization ¶

Question: What’s the probability of events A, B, and C if we know the probability of event B is zero?

It’s zero – and this leads us to the problem with fitting models to maximize the likelihood of the training data.

If the training data set lacks less likely feature-category combinations, our model may overconfidently predict a category probability to be close to zero when it isn’t. Then when the combination occurs in out-of-sample data, our model erroneously predicts the data to have near zero probability.

This problem is known as overfitting.

One way to control overfitting is to use a technique called regularization. With regularization, we change the cost function so that instead of minimizing the negative log likelihood of the training data, we minimize

J(\boldsymbol{w}) = \sum_{i=1}^n \ell_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right) + \frac{1}{2C} \| \boldsymbol{w} \|^2.

$C$ is called a hyperparameter, and it acts as a knob that controls the complexity of the model. When the knob is dialed up to full regularization strength (i.e. $C$ is close to zero), it forces the model to take its simplest form, $\boldsymbol{w} = 0.$ On the other hand, when $C$ is at its weakest (i.e. $C$ is large), it’s as if regularization isn’t there.

By setting $C$ to a good value, we can prevent overfitting while still keeping the predictive strength of our model.

Cross-validation ¶

How do we find a good value of $C$ ?

This is the basic idea: We construct a function $f(C)$ that estimates the out-of-sample prediction error for a given $C$ . Then we select $C$ so as to minimize $f$ . This process is called hyperparameter optimization.

One way to estimate out-of-sample prediction error is with a technique called cross-validation. Cross-validation partitions the training data set into folds. For each fold, we form two sets of data: one consisting of the training data points outside of the fold and one consisting of the training data points in the fold. We fit a model to the first set and measure the prediction error on the second and then average all the fold prediction errors to form the out-of-sample prediction error estimate.

Cross-validation comes in different forms. K-fold cross-validation partitions the training data set into $k$ different folds. The larger the value of $k$ , the more accurate the estimate; and largest value of $k$ we can use is $k = n-1$ , called leave-one-out cross-validation.

While leave-one-out cross-validation provides the best estimate for out-of-sample prediction error, it is expensive to compute. It requires that we train a different model for each data point in the training data set. Fortunately, for logistic regression, there exists a different form of cross-validation that is nearly as accurate as leave-one-out, but much more efficient to compute.

Approximate Leave-one-out Cross-validation ¶

Enter Approximate Leave-one-out Cross-validation (ALO). It has much of the benefits of leave-one-out cross-validation, while being efficient to compute and optimize. Here’s how it works: Given $C$ we first fit logistic regression by minimizing the regularized cost function of the whole training data set. Let $\hat{\boldsymbol{w}}$ denote the optimal weights

\hat{\boldsymbol{w}} = \operatorname*{argmin}_{\boldsymbol{w}} J\left(\boldsymbol{w}\right),

and $\boldsymbol{H}$ denote the hessian of the cost function at the optimum

\boldsymbol{H} = \nabla^2 J(\hat{\boldsymbol{w}}).

Note: The gradient will be zero.

Now, suppose we modify the cost function by removing the ith training data point. We call this new cost function $J_{/i}$

J_{/i}\left(\boldsymbol{w}\right) = J\left(\boldsymbol{w}\right) - \ell_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right),

and we can compute its gradient and hessian at $\hat{\boldsymbol{w}}$ :

\begin{align*} \nabla J_{/i}\left(\hat{\boldsymbol{w}}\right) &= -\dot{\ell}_i\left(\boldsymbol{x}_i^\top \hat{\boldsymbol{w}}\right) \boldsymbol{x}_i, \\ \nabla^2 J_{/i}\left(\hat{\boldsymbol{w}}\right) &= \boldsymbol{H} - \ddot{\ell}_i\left(\boldsymbol{x}_i^\top \hat{\boldsymbol{w}}\right) \boldsymbol{x}_i \boldsymbol{x}_i^\top. \end{align*}

Using the gradient and hessian of $J_{/i},$ we can take a single step of Newton’s optimization algorithm and get a close approximation for what the weights would be without the ith data point

\tilde{\boldsymbol{w}}_{/i} = \hat{\boldsymbol{w}} - \left[ \boldsymbol{H} - \ddot{\ell}_i\left(\boldsymbol{x}_i^\top \hat{\boldsymbol{w}}\right) \boldsymbol{x}_i \boldsymbol{x}_i^\top \right]^{-1} \left( -\dot{\ell}_i\left(\boldsymbol{x}_i^\top \hat{\boldsymbol{w}}\right) \boldsymbol{x}_i \right),

and then estimate the leave-one-out cross-validation with

\frac{1}{n}\sum_{i=1}^n \ell_i\left(\boldsymbol{x}_i^\top \tilde{\boldsymbol{w}}_{/i}\right).

Note

Here’s a quick refresher on Newton’s method.

Let $f(\boldsymbol{x})$ denote the function we’re minimizing. The second-order Taylor series approximation cenerted at point $\boldsymbol{x}_0$ is

f(\boldsymbol{x}_0 + \boldsymbol{h}) \approx f(\boldsymbol{x}_0) + \nabla f(\boldsymbol{x}_0)^\top \boldsymbol{h} + \frac{1}{2} \boldsymbol{h}^\top \nabla^2 f(\boldsymbol{x}_0) \boldsymbol{h}.

If $\nabla^2 f(\boldsymbol{x}_0)$ is positive-definite, then the approximation is minimized when the gradient with respect to $\boldsymbol{h}$ is zero, or equivalently

\boldsymbol{h} = -\left(\nabla^2 f(\boldsymbol{x}_0)\right)^{-1} \nabla f(\boldsymbol{x}_o).

Using this formula we avoid having to fit a model for each data point, but we still need to invert a $p$ -by- $p$ matrix $n$ times. We can achieve an even more efficient formula that only requires a single matrix inversion by applying the matrix inversion lemma to get

\frac{1}{n}\sum_{i=1}^n \ell_i\left( u_i + \frac{\dot{\ell_i}(u_i) h_i}{1 - \ddot{\ell}_i(u_i) h_i} \right),

where

u_i = \boldsymbol{x}_i^\top \hat{\boldsymbol{w}}\quad\textrm{and}\quad h_i = \boldsymbol{x}_i^\top \boldsymbol{H}^{-1} \boldsymbol{x}_i.

Note

If $\boldsymbol{A} \in \mathbb{R}^{n \times n}$ is an invertible square matrix, $\boldsymbol{u}, \boldsymbol{v} \in \mathbb{R}^n$ are vectors, and $\boldsymbol{A} + \boldsymbol{u} \boldsymbol{v}^\top$ is invertible, then the matrix inversion lemma says

$\left(\boldsymbol{A} + \boldsymbol{u} \boldsymbol{v}^\top\right)^{-1} = \boldsymbol{A}^{-1} - \frac{\boldsymbol{A}^{-1} \boldsymbol{u} \boldsymbol{v}^\top \boldsymbol{A}^{-1}} {1 + \boldsymbol{v}^\top \boldsymbol{A}^{-1} \boldsymbol{u}}.$
Thus,

$\begin{align*} \boldsymbol{x}_i^\top \tilde{\boldsymbol{w}}_{/i} &= \boldsymbol{x}_i^\top\left( \hat{\boldsymbol{w}} - \left[\boldsymbol{H} - \ddot{\ell}_i(u_i) \boldsymbol{x}_i \boldsymbol{x}_i^\top \right]^{-1} \left(-\dot{\ell}_i(u_i) \boldsymbol{x}_i\right) \right) \\ &= u_i + \dot{\ell}_i(u_i) \boldsymbol{x}_i^\top \left[\boldsymbol{H} - \ddot{\ell}_i(u_i) \boldsymbol{x}_i \boldsymbol{x}_i^\top\right]^{-1} \boldsymbol{x}_i \\ &= u_i + \dot{\ell}_i(u_i) \boldsymbol{x}_i^\top \left[\boldsymbol{H}^{-1} + \frac{\ddot{\ell}_i(u_i) \boldsymbol{H}^{-1} \boldsymbol{x}_i \boldsymbol{x}_i^\top \boldsymbol{H}^{-1}} {1 - \ddot{\ell}_i(u_i) \boldsymbol{x}_i \boldsymbol{H}^{-1} \boldsymbol{x}_i} \right] \boldsymbol{x}_i \\ &= u_i + \dot{\ell}_i(u_i) \left[ h_i + \frac{\ddot{\ell}_i(u_i) h_i^2}{1 - \ddot{\ell}_i(u_i) h_i}\right] \\ &= u_i + \frac{\dot{\ell}_i(u_i) h_i}{1 - \ddot{\ell}_i(u_i) h_i}. \end{align*}$

Optimizing Approximate Leave-one-out Cross-validation ¶

But accurately estimating out-of-sample prediction error is just one part of hyperparameter optimization. Now that we have our efficient ALO estimator, how do we find the hyperparameters that minimize the estimated prediction error?

Fortunately, ALO isn’t just efficient to compute. We can also compute exact derivatives of ALO with respect to the hyperparameter, making it extremely efficient to dial in to the parameters that minimizes estimated prediction error.

First, we make a minor change to the way the cost function is parameterized. Define

J_\alpha\left(\boldsymbol{w}\right) = \sum_{i=1}^n \ell_i\left(\boldsymbol{x}_i^\top \boldsymbol{w}\right) + \alpha^2 \| \boldsymbol{w} \|^2,

let $\hat{\boldsymbol{w}}_\alpha$ denote the weights that minimize $J_\alpha$

\hat{\boldsymbol{w}}_\alpha = \operatorname*{argmin}_{\boldsymbol{w}} J_\alpha\left(\boldsymbol{w}\right),

and define

f(\alpha) = \frac{1}{n} \sum_{i=1}^n \ell_i\left( u_i + \frac{\dot{\ell}_i(u_i) h_i}{1 - \ddot{\ell}_i(u_i) h_i} \right).

$f$ is our ALO measurement and what we need to minimize.

To optimize $f,$ we can again apply trust-region optimization using the first and second derivatives

\frac{d f(\alpha)}{d \alpha} \quad\textrm{and}\quad\frac{d^2 f(\alpha)}{d \alpha^2}.

Note

While efficient to compute, the derivative formulas for $f$ are quite involved, so we won’t derive them in this guide. But, if you want all the details, check out the paper

Optimizing Approximate Leave-one-out Cross-validation to Tune Hyperparameters.

Example: Breast Cancer Diagonostic (with Regularization)¶

Let’s now revisit the Breast Cancer Data Set and use regularization.

We again start by loading and normalizing the data set.

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']

While a lot happens behind the scenes, from the user point of view fitting a model to minimize ALO is easy. We’ll fit a model and print out the value of $C$ found.

import bbai.glm

model = bbai.glm.LogisticRegression()
# Note: bbai.glm.LogisticRegression defaults to find the C that
# minimizes ALO if none is specified.

model.fit(X, y)

print("C =", model.C_)

This prints out

C = 0.6655139682151275.

The Breast Cancer Data Set is small enough that we can compute the leave-one-out cross-validation across a range of values of $C$ by brute force. Let’s plot it out, compare it to ALO, and see where the value we found for $C$ lies along the curve.

First, we write the function to compute leave-one-out cross-validation by brute force.

import numpy as np

def compute_loocv(X, y, C):
    model = bbai.glm.LogisticRegression(C=C)
    n = len(y)
    loo_likelihoods = []
    for i in range(n):
        train_indexes = [i_p for i_p in range(n) if i_p != i]
        test_indexes = [i]
        X_train, X_test = X[train_indexes], X[test_indexes]
        y_train, y_test = y[train_indexes], y[test_indexes]
        model.fit(X_train, y_train)
        pred = model.predict_proba(X_test)[0]
        loo_likelihoods.append(pred[y_test[0]])
    return sum(np.log(loo_likelihoods))

Next, we write the function to compute ALO using the efficient formula.

import scipy

def fit_logistic_regression(X, y, C):
    model = bbai.glm.LogisticRegression(C=C)
    model.fit(X, y)
    return np.array(list(model.coef_[0]) + list(model.intercept_))

def compute_hessian(p_vector, X, alpha):
    n, k = X.shape
    a_vector = np.sqrt((1 - p_vector)*p_vector)
    R = scipy.linalg.qr(a_vector.reshape((n, 1))*X, mode='r')[0]
    H = np.dot(R.T, R)
    for i in range(k-1):
        H[i, i] += alpha
    return H

def compute_alo(X, y, C):
    alpha = 1.0 / C
    w = fit_logistic_regression(X, y, C)
    X = np.hstack((X, np.ones((X.shape[0], 1))))
    n = X.shape[0]
    y = 2*y - 1
    u_vector = np.dot(X, w)
    p_vector = scipy.special.expit(u_vector*y)
    H = compute_hessian(p_vector, X, alpha)
    L = np.linalg.cholesky(H)
    T = scipy.linalg.solve_triangular(L, X.T, lower=True)
    h_vector = np.array([np.dot(ti, ti) for pi, ti in zip(p_vector, T.T)])
    loo_u_vector = u_vector - \
        y * (1 - p_vector)*h_vector / (1 - p_vector*(1 - p_vector)*h_vector)
    loo_likelihoods = scipy.special.expit(y*loo_u_vector)
    return sum(np.log(loo_likelihoods))

Now, let’s plot out leave-one-out cross-validation and ALO. We’ll draw a vertical line to indicate the optimal value of $C$ we found.

import matplotlib.pyplot as plt

Cs = np.arange(0.1, 2.0, 0.1)
loocvs = [compute_loocv(X, y, C) for C in Cs]
alos = [compute_alo(X, y, C) for C in Cs]

fig, ax = plt.subplots()
ax.plot(Cs, loocvs, label='LOOCV', marker='o')
ax.plot(Cs, alos, label='ALO', marker='x')
ax.axvline(model.C_, color='tab:green', label='C_opt')
ax.set_xlabel('C')
ax.set_ylabel('Log-Likelihood')
ax.set_title("Breast Cancer Dataset")
ax.legend()

This displays

Handling More Than Two Classes ¶

The logistic function we presented only works with two classes. So what do we do when we have more than two classes?

Let $K$ denote the number of classes and let $\boldsymbol{W}$ represent a $K\times p$ matrix of weights. We compute the probability of the kth class for the ith data entry as

\frac{\exp\left(\boldsymbol{e}_k^\top \boldsymbol{W} \boldsymbol{x}_i\right)} {\sum_{k=1}^K \exp\left(\boldsymbol{e}_k^\top \boldsymbol{W}\boldsymbol{x}_i\right)}.

This is called the softmax function, and we call the version of logistic regression that uses it multinomial logistic regression. Put

\ell_i(\boldsymbol{u}) = -\log \left[ \boldsymbol{e}_{y_i}^\top \boldsymbol{u}\right] + \log\left[\sum_{k=1}^K \exp\left(\boldsymbol{e}_k^\top \boldsymbol{u}\right)\right].

The cost function for multinomial logistic regression is

J\left(\boldsymbol{W}\right) = \sum_{i=1}^n \ell_i\left(\boldsymbol{W} \boldsymbol{x}_i\right) + \frac{1}{2 C} \sum_{k=1}^K \sum_{j=1}^p W_{kj}^2.

We compute its gradient as

\frac{\partial}{\partial W_{kj}} J\left(\boldsymbol{W}\right) = \sum_{i=1}^n x_{ij} \left(\frac{\partial \ell_i }{\partial u_k}\right) \left(\boldsymbol{W} \boldsymbol{x}_i\right) + \frac{1}{C} W_{kj},

and we compute the hessian as

\begin{align*} \frac{\partial^2}{\partial W_{kj} \partial W_{k'j'}} J\left(\boldsymbol{W}\right) &= \sum_{i=1}^n x_{ij} x_{ij'} \left(\frac{\partial^2 \ell_i }{\partial u_k \partial u_{k'}}\right) \left(\boldsymbol{W} \boldsymbol{x}_i\right) \\ &\phantom{=}\quad + \frac{1}{C} \delta_{kk'} \delta_{jj'}. \end{align*}

We can similarly adopt ALO for multinomial logistic regression.

Example: Iris Species ¶

We’ll apply multinomial logistic regression to fit a model to the Iris Data Set.

Load and normalize the data set.

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

data = load_iris()
X = data['data']
X = StandardScaler().fit_transform(X)
y = data['target']

Fit multinomial logistic regression with $C$ set so as to minimize the error on the approximate leave-one-out cross-validation.

import bbai.glm
model = bbai.glm.LogisticRegression()
model.fit(X, y)
print(model.C_)

Prints

C = 43.70957582240895.