Skip to content

GUNH003/logistic_regression_model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

logistic_regression_model

Losgistic Regression
In general linear regression, the values of dependent variable $\boldsymbol{y}$ are modeled as the linear combination between coefficients and the obeserved values of independent variables, and there are normaly no restrictions applied to the values that dependent variable $\boldsymbol{y}$ can take. $\boldsymbol{y}$ is a $n \times 1$ vector, $\boldsymbol{X}$ is a $n \times k$ matrix, $\boldsymbol{\beta}$ is a $k \times 1$ vector, and $\boldsymbol{e}$ is a $n \times 1$ vector. $$\boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{e}$$ However, if the model needs to predict probability, the value of the dependent variable has to be limited between $\text{0}$ and $\text{1}$, by the defination of probability. In order to acheive this, the logsitic regression uses the sigmoid function to perform non-linear transformation of the result obtained by the linear combination. $$\sigma(\boldsymbol{z}) = \frac{1}{1 + e^{-\boldsymbol{z}}}$$ Replace $\boldsymbol{z}$ with $\boldsymbol{y}$: $$\sigma(\boldsymbol{y}) = \frac{1}{1 + e^{-\boldsymbol{y}}}$$ Let $p_i$ denote the probability of success for dependent variable for the $\textit{i}$ th observation, then: $$p_i = \frac{1}{1 + e^{-\boldsymbol{\boldsymbol{\beta}\boldsymbol{x_i}}}}$$
It is also worth noticing that the change in output of the lienar combination in logistic regression is equal to the log of odds ratio, since: $$log(\frac{p_i}{1 - p_i}) = \boldsymbol{\boldsymbol{\beta}\boldsymbol{x_i}}$$

where $\boldsymbol{x_i}$ is the observed value of the $\textit{i}$ th observation. The odds and odds ratio are defined as: $$odds_i = \frac{p_i}{1 - p_i}$$ $$odds ratio_{ij} = \frac{\frac{p_i}{1 - p_i}}{\frac{p_j}{1 - p_j}}$$ This property of logistic regression affect how the change in model coefficient is interpreted when compared to General Linear Model, as one unit change in a given coefficient implies one unit change in the $log(odds)$ in a logistic regression model.

Maximum Likelihood Estimation
The likelihood function is defined for the logsitic model is: $$\textit{l}(\boldsymbol{\beta}) = \prod_{i=1}^{n}p_i^{y_i}(1-p_i)^{1-y_i}$$ The log-likelihood can be defined as follows: $$\textit{ll}(\boldsymbol{\beta}) = \sum_{i=1}^{n}{y_i}log(p_i) + {(1-y_i)}log(1-p_i)$$ The gradient can be obtianed by taking the partial derivative with respect to $\boldsymbol{\beta}$ using chain rule: $$\nabla_{\boldsymbol{\beta}}\textit{ll}(\boldsymbol{\beta}) = \boldsymbol{X}^{T}(\boldsymbol{y} - \boldsymbol{p})$$ Because of the transcendental nature of exponential in the Sigmoid function, gradient ascent method can be used to obtain the $\boldsymbol{\beta}$ that maximize the log-likelihood function (where $\textit{l}(\boldsymbol{\beta}) = 0$).
The second partial derivative of the log-likelihood function with respect to $\boldsymbol{\beta}$ gives the Hessian matrix, which can be used to derive the estiamted covariance matrix of the parameters $\boldsymbol{\beta}$. The Hessian matrix can be expressed as: $$\nabla_{\boldsymbol{\beta}}^{2}\textit{ll}(\boldsymbol{\beta}) = \boldsymbol{H} = -\boldsymbol{X}^{T}\boldsymbol{W}\boldsymbol{X}$$ And the estimated covariance matrix can be expressed as: $$\text{Cov}(\boldsymbol{\beta}) = (-\boldsymbol{X}^{T}\boldsymbol{W}\boldsymbol{X})^{-1}$$ Wald test can be used to test the significance of the coefficients.

Maximum A Posteriori Estimation
The MLE approach is a frequentist approach that does not take into account any prior knowledge about the parameters. On the other hand, the Maximum A Posteriori (MAP) estimation incorporates prior knowledge about the parameters through the prior distribution $P(\beta)$. Using Bayes Theorem, the posteriori distribution of the regression coefficients, $P(\beta|data)$, can be expressed as follows: $$P(\beta | data) = \frac{P(data | \beta) \cdot P(\beta)}{P(data)}$$ The prior distribution $P(\beta)$ represents the prior belief about the coefficients before observing the data. In case where no specific knowledge about the parameters exists, a non-informative prior such as a normal distribution with zero mean and large variance could be assumed as the prior distribution, so that the posterior distribution will be more significantly influenced by $P(data | \beta)$. The posterior distribution $P(\beta | data)$ represents the updated belief about the coefficients after observing the data. The MAP estimate is the value of $\boldsymbol{\beta}$ that maximizes this posterior distribution, but since Bayes Theorem shows that the posterior distribution is proportional to the product of $P(data | \beta)$ and $P(\beta)$, the MAP estimate can also be acquaired by maximizing the right hand side of the expression: $$P(\beta | data) \propto P(data | \beta) \cdot P(\beta)$$
The following assumptions are made in this project for MAP estimation:

  • The coefficients $\boldsymbol{\beta}$ follow a non-informative multivariate normal distribution as the prior distribution (zero mean and large variance).
  • The coefficients $\boldsymbol{\beta}$ are independent (diagonal covariance matrix).

The same technique can be applied to MAP estimation to acquire the $\boldsymbol{\beta}$ that maximize $P(data | \beta) \cdot P(\beta)$. Base on previous assumptions, the log-likelihood function for multivariate normal distribution is written as: $$\textit{ll}(\boldsymbol{\beta}) = -\frac{n}{2}\log|\boldsymbol{\Sigma}| -\frac{1}{2}\sum_{i=1}^{n}(\boldsymbol{\beta} - \boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\boldsymbol{\beta} - \boldsymbol{\mu})$$
where $\boldsymbol{\mu} = \boldsymbol{0}$ and $\boldsymbol{\Sigma} = \boldsymbol{\sigma}^{2}\boldsymbol{I}$. The gradient for the log-likelihood function is: $$\nabla_{\boldsymbol{\beta}}\textit{ll}(\boldsymbol{\beta}) = -\boldsymbol{\Sigma}^{-1}(\boldsymbol{\beta} - \boldsymbol{\mu})$$ The graddient used to update $\boldsymbol{\beta}$ with respect to the total log-likelihood function in MAP estimation is then expressed as: $$\nabla_{\boldsymbol{\beta}}\textit{lltotal}(\boldsymbol{\beta}) = \boldsymbol{X}^{T}(\boldsymbol{y} - \boldsymbol{p}) - \boldsymbol{\Sigma}^{-1}(\boldsymbol{\beta} - \boldsymbol{\mu})$$

Posterior Distribution Estimation
In MCMC algorithms, the Monte Carlo simulation describes the process of using a Markov chain to produce random samples that can be used for estimation (e.g. samples of logistic regression parameters as the parameters themselves are treated as random variables) by applying the law of large numbers. Markov chain models the probability of transitions between states, and the stationary state of a Markov chain can be treated as a probability distribution. Running a Markov chain after it has reached the stationary state can then be considered as a process of random sampling from the stationary distribution.
One sufficient but not necessary condition to ensure the stationary distribution exists is the “detailed balance” condition, which states that the probability of transitioning from the $\textit{ith}$ state to the $\textit{(i + 1)th}$ state is equal to the probability of transitioning from the $\textit{(i + 1)th}$ state to the $\textit{ith}$ state. $$T(x'|x)P(x) = T(x|x')P(x')$$
where $T(x | x')$ is the probability of transition for state $x$ to $x'$, $P(x')$ is the probability of being at state $x'$, and vice-versa. The Metropolis algorithm uses this property to construct a probabilistic way to propose the next state (with a specified probability of acceptance) of the Markov chain, so that the stationary distribution of the Markov chain corresponds to the posterior distribution. In practice, the Metropolis algorithm deliberately separates the $T(x | x')$ as follows: $$T(x | x') = Q(x | x')A(x | x')$$ where $Q(x' | x)$ is the propability of proposing the next state $x'$ given current state $x$, $A(x' | x)$ is the probability of accepting this candidate state x' given current state x. Then, the detailed balance condition can be reconstructed into the following expression: $$\frac{A(x' | x)}{A(x | x')} = \frac{P(x')Q(x | x')}{P(x)Q(x' | x)}$$ Then, $A(x' | x)$ is designed in a way such that the detailed balance condition can be maintained, which further ensures that the Markov chain converges on the stationary distribution. $$A(x' | x) = min(1, \frac{P(x')Q(x | x')}{P(x)Q(x' | x)})$$ In this project, the arbitrarily chosen proposal distribution $Q(x' | x)$ is a multivariate normal distribution with mean $x$ and a covariance matrix with $0.001$ on the diagonals and $0$ for all the off-diagonal elements. The functionality of this proposal distribution is to propose the next state for the Markov chain. However, it is up to $A(x' | x)$ to determine if the next state should be added to the sample. As the proposal distribution is a symmetric multivariate normal distribution, the $A(x' | x)$ can be further simplified as follows: $$A(x' | x) = min(1, \frac{P(x')}{P(x)})$$ where $P(x')$ is the likelihood of the proposed parameter, and $P(x)$ is the likelihood of the current parameter. The algorithm states the steps as follows:

  1. Initialize a empty list samples
  2. Define initial states $\boldsymbol{\beta}$ and add it to the samples
  3. If sample size is not reached, propose the next candidate $\boldsymbol{\beta'}$ be the randomly select a sample from the proposal distribution $MVN(\boldsymbol{\beta}, \boldsymbol{\sigma}^{2}I)$. If sample size reached, stop.
  4. Calculate the acceptance probability. Let $A(\boldsymbol{\beta} | \boldsymbol{\beta'}) = min(1,\frac{\textit{ll}(\boldsymbol{\beta'})}{\textit{ll}(\boldsymbol{\beta})})$. Let $u$ be a random sample from a uniform distribution $U(0,1)$. If $u <= A(\boldsymbol{\beta} | \boldsymbol{\beta'})$ , accept $\boldsymbol{\beta'}$ and add $\boldsymbol{\beta'}$ to the samples list, then go to step 2 and propose the next candidate from $MVN(\boldsymbol{\beta'}, \boldsymbol{\sigma}^{2}I)$; if $u > A(\boldsymbol{\beta} | \boldsymbol{\beta'})$, reject $\boldsymbol{\beta'}$ and add $\boldsymbol{\beta}$ to the samples list, then go to step 2 and propose the next candidate from $MVN(\boldsymbol{\beta}, \boldsymbol{\sigma}^{2}I)$. After running the Metropolis algorithm, the initial states that did not reach stationary distribution (burn-in state) will be discarded. Then the rest of the samples will be used to estimate the expectation and variance of the stationary distribution, which is the posterior distribution in the Bayesian regression.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages