Maximum Likelihood Estimation

Maximum Likelihood Estimation

Purpose

Notes about Maximum Likelihood Cost Function

Cost Function

$\displaystyle J(m, c) = \sum_{i=1}^N(y_i - yp_i)^2 = \sum_{i=1}^N(y_i - (mx_i+c))^2$

Closed Form

  • Take partial derivate and equate to 0

  • $\displaystyle \frac{\partial j}{\partial m} = 0$

  • $\displaystyle \frac{\partial j}{\partial c} = 0$

  • Solve for Two equations, Two unknowns

Iterative Method

  • Assume a starting point for m, c and update them by taking negative gradient values iteratively till the values of m & c don't change much.
  • $

\begin{bmatrix} m \ \ c \end{bmatrix}^{new} $=$ \begin{bmatrix} m \ \ c \end{bmatrix}^{old} $ $-\eta \begin{bmatrix} \frac{\partial J}{\partial m} \ \ \frac{\partial J}{\partial c} \ \end{bmatrix}_{\begin{bmatrix}m\c\end{bmatrix}^{old}} $

Random Variables

A random variable is a variable whose possible values are numerical outcomes of a random phenomenon or event.

There are two types of random variables

  • Discrete like numbers on a dice
  • Continuous like length of hair

Maximum Likelihood Cost Function

Suppose we have n samples from independent and identically distributed observations, coming from an unknown probability density function f(x|theta), where theta is unknown.

So, how to arrive at the log-likelihood function? Here are the steps to sum it up:

  1. Making a Joint Density Function.

    $\displaystyle J(\theta) = P(x_1, x_2, \ldots, x_n; \theta)$

  2. Finding the Likelihood function, x samples are fixed “parameters” and theta will be the function’s variable.

    $\displaystyle L(\theta) = \prod_{i=1}^{N}P(x_1;\theta) \bigg|$ Independence Assumption

  3. For further simplification, we make it Log-Likelihood function, as it's easier to deal with log.

    $\displaystyle l(\theta) = log(L(\theta)) = \sum_{i=1}^NlogP(x_i;\theta) \bigg|$ taking log

We can take log because log is a monotonic function. We observe that the max/min values obtained through log is also applicable to the product form. Moreover, it helps in simplifying the equation.

What is MLE

MLE is basically a technique to find the parameters that maximise the likelihood of observing the data points assuming they were generated through a given distribution.

Finding Parameters of Gaussian Distribution through MLE

  • Parameters for Gaussian Distribution are $\mu$ and $\sigma$
  • PDF for Normal Distribution:

$\displaystyle f(x) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

$\displaystyle P(x_i;\theta) = p(x_i; \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

$\displaystyle l(\theta) = \ln(p(x_i;\theta)) = J(\mu, \sigma) = \sum_{i=1}^N\bigg[\ln\frac{1}{\sqrt{2\pi}\sigma} - \frac{(x_i-\mu)^2}{2\sigma^2}\bigg]$

Now we have a cost function, we can use iterative or closed form method to solve the equation.

  • $\displaystyle J(\mu, \sigma) = \sum_{i=1}^N\bigg[\ln\frac{1}{\sqrt{2\pi}\sigma} - \frac{(x_i-\mu)^2}{2\sigma^2}\bigg]$

    $\displaystyle =-\sum_{i=1}^{N}ln(\sigma\sqrt{2\pi}) - \frac{(x_i-\mu)^2}{2\sigma^2}$

  • $\displaystyle \frac{\partial J}{\partial \mu} = 0 \implies \frac{\partial J}{\partial \mu}(-\sum_{i=1}^N\frac{(x_i-\mu)^2}{2\sigma^2}) = 0$

    $\displaystyle = \frac{-1}{2\sigma^2}\sum_{i=1}^{N}2(x_i-\mu)(-1) = 0$

  • $\displaystyle \sum_{i=1}^{N}x_i = \sum_{i=1}^{N}\hat \mu \implies \boxed{\hat \mu = \frac{\sum x_i}{N}}$

Similarly, we can find out the value for standard deviation ($\sigma$)

  • $\boxed{\displaystyle \hat \sigma = \sqrt{\frac{\sum_{i=1}^{N}(x_i-\hat \mu)^2}{N}}}$

Note

Ordinary Least Squares is the Maximum Likelihood for a Linear Model.

Maximum Likelihood Estimate for Discrete Distributions

Bernoulli Distribution

The Bernoulli distribution models events with two possible outcomes: either success or failure.

  • PMF of Bernoulli Distribution
    • p, if x = 1
    • 1-p, if x = 0
    • 0, if x does not belong in $R_x$ (Support)

The likelihood for p based on X is defined as the joint probability distribution of X1, X2, . . . , Xn. Since X1, X2, . . . , Xn are i.i.d random variables, the joint distribution is

$\displaystyle L(p;x) \approx f(x;p) = \prod_{i=1}^{n}f(x_i;p) = \prod_{i=1}^{n}p^x(1-p)^{1-x}$

Taking log,

$\displaystyle \ln f = \sum xi \ln p + (n - \sum x_i)\ln(1-p)$

Differentiating the log of L(p; x) with respect to p and setting the derivative to zero shows that this function achieves a maximum at

$\displaystyle \hat p = \sum_{i=1}^{N}\frac{x_i}{n}$

Logistic Regression

1. Starting with log form of MLE

  • $\displaystyle p(y;p) = \prod_{i=1}^{N}(y_i;\underline p)$
  • $l(p) = \sum[y_i \ln \underline p + (1-y_i)\ln (1-\underline p)]$

2. Sigmoid Function

  • $\displaystyle P(y_{i}|x_{i}) = \sigma(\beta_0 + \beta_1x_{i,1})$

    • Here $x_{i,j}: i \implies i^{th}\text{ sample and } j^{th} \text{feature}$
  • $\displaystyle \sigma(\beta_0+\beta_1x_{i,1})$ can be written as $\sigma \beta^{T}x_{i}$

  • $\displaystyle \sigma(\beta^{T}x_{i}) = \frac{1}{1+e^{-\beta^{T}x_{i}}} = \frac{e^{\beta^{T}x_{i}}}{1+e^{\beta^{T}x_{i}}}$

  • $\displaystyle P(1-y_{i}|x_{i}) = 1-\frac{e^{\beta^{T}x_{i}}}{1+e^{\beta^{T}x_{i}}} = \frac{1}{1+e^{\beta^{T}x_{i}}}$

  • Note that we found out:

    • $\displaystyle \Large \boxed {P(y_{i}|x_{i}) = \frac{e^{\beta^{T}x_{i}}}{1+e^{\beta^{T}x_{i}}}}$

    • $\displaystyle \Large \boxed {P(1-y_{i}|x_{i}) = \frac{1}{1+e^{\beta^{T}x_{i}}}}$

3. Substituting the values in $\log$ expression

  • $\displaystyle l(\beta) = - \sum_{i=1}^{N}\ln(1+e^{\beta^{T}x_{i}}) + \sum_{i=1}^{N}(y_{i}\beta^{T}x_{i})$

4. We can take partial derivate w.r.t $\beta_0 ;& \beta_1$ and solve the equations using gradient descent

  • $\displaystyle \frac{\partial l(\beta)}{\partial \beta_1} = -\sum_{i=1}^{N}\frac{e^{\beta^{T}x_{i}}}{1+e^{\beta^{T}x_{i}}}x_{i,1} + \sum_{i=1}^{N}y_{i}x_{i,1}$

  • $\large \displaystyle \frac{\partial l(\beta)}{\partial \beta_j} = \sum_{i=1}^{N}(y_{i} - p(y_{i}|x_{i}))x_{i,1}$

    • $y_{i}$ - actual value
    • $p(y_{i}|x_{i})$ - predicted value
    • $y_{i} - p(y_{i}|x_{i})$ - error term
  • After simplification, it will turn into simple matrix multiplication, where we will take transpose of the $\beta$ matrix and multiply it with the error matrix.

  • $\begin{bmatrix}e_1, e_2,\ldots,e_n\end{bmatrix} \begin{bmatrix}x_1\x_2\\ldots\xn\end{bmatrix}

$

  • $\begin{bmatrix}x_1,x_2,\ldots,xn\end{bmatrix}\begin {bmatrix}e_1\ e_2\\ldots\e_n\end{bmatrix}

$

  • $\large \triangledown_{\beta}l(\beta) =

\begin{bmatrix}\frac{\partial l(\beta)}{\partial \beta_0} \ \ \frac{\partial l(\beta)}{\partial \beta_1} \end{bmatrix} = \begin{bmatrix}x_{1,0}, x_{2,0},\ldots,x_{n,0} \ x_{1,1}, x_{2,1},\ldots,x_{n,1} \end{bmatrix} \begin{bmatrix}e_1\e_2\\ldots\e_n\end{bmatrix} =X^{T}e $

  • The matrix multiplication can be carried by np.matmul in python
  • Shape of X will be $N \times (d+1)$ where
    • d is the number of features
    • N is the number of observations
  • Shape of $X^{T}$ (X-Transpose) will be $(d+1) \times N$ where d is the number of features
    • d is the number of features
    • N is the number of observations

Obtaining Parameters using Newton Raphson Method

  • It is a second order method
  • Gradient Descent: $\displaystyle \beta_j^{new} = \beta_j^{old} - \eta \frac{\partial l(\beta)}{\partial \beta_j}$
  • Newton Raphson: $\displaystyle \beta_j^{new} = \beta_j^{old} - H^{-1} \frac{\partial l(\beta)}{\partial \beta_j}$

Elements of Hessian Matrix

  • $\displaystyle \large \frac{\partial l(\beta)}{\partial \beta_k \partial \beta_j}$
  • $\large \displaystyle \frac{\partial l(\beta)}{\partial \beta_j} = \sum_{i=1}^{N}(y_{i} - p(y_{i}|x_{i}))x_{i,1}$
  • $\displaystyle \large \frac{\partial l(\beta)}{\partial \beta_k \partial \beta_j} = \sum_{i=1}^{N}\frac{(1+e^{\beta^{T}x_{i}})e^{\beta^{T}x_{i}} x_{i,k} - e^{\beta^{T}x_{i}}e^{\beta^{T}x_{i}}x_{i,k}}{(1+e^{\beta^{T}x_{i}})^2}x_{i,j}$
  • $\displaystyle \sum_{i=1}^{N}P(y_{i}|x_{i}) + (P(y_{i}|x_{i}))^2x_{i,k}.x_{i,j}$
  • $\displaystyle -\sum_{i=1}^{N}(P(y_{i}|x_{i})(1-P(y_{i}|x_{i})))$

Application of MLE: Naive Bayes Classifier

Two class classifier:

  • $\displaystyle P(C=C_1|x)$
  • $\displaystyle P(C=C_2|x)$
  • $\displaystyle P(C=C_1|x) + P(C=C_2|x) = 1$

$\displaystyle \text{posterior probability} = \frac{\text{conditional probability.prior probability}}{evidence}$

  • The posterior probability, in the context of a classification problem, can be interpreted, “given the feature vector $x_i$ what is the probability of sample "$i$" belonging to class "$c_j$”?

  • $\displaystyle P(\text{Female}|x) = \frac{P(x|\text{Female}).P(\text{Female})}{P(x)}$

  • $\displaystyle P(\text{Male}|x) = \frac{P(x|\text{Male}).P(\text{Male})}{P(x)}$

  • If $P(\text{Female}|x) > P(\text{Male}|x)$, we say that the person is female

  • We don't need to care about P(x) when doing comparison

  • Learning: from the data, we can find out $\mu_m, \mu_f, \sigma_m, \sigma_f$ by fitting Gaussian distributions over each class

Assumptions

  • Distribution is Gaussian
    • if the distribution is not Gaussian, then the predictions will be bad
  • Naive Assumption
    • features are conditionally independent
    • $\displaystyle P(X|Y,Z) = P(X|Z)$
    • lets us break down features into products
    • $\displaystyle P(x_1,x_2|c) = P(x_1|c).P(x_2|c)$

References


Tagged in pythonmachine-learningpredictive-analysis