Maximum Likelihood Estimation

Purpose

Notes about Maximum Likelihood Cost Function

Cost Function

J(m,c)=i=1N(yiypi)2=i=1N(yi(mxi+c))2\displaystyle J(m, c) = \sum_{i=1}^N(y_i - yp_i)^2 = \sum_{i=1}^N(y_i - (mx_i+c))^2

Closed Form

  • Take partial derivate and equate to 0
  • jm=0\displaystyle \frac{\partial j}{\partial m} = 0
  • jc=0\displaystyle \frac{\partial j}{\partial c} = 0
  • Solve for Two equations, Two unknowns

Iterative Method

  • Assume a starting point for m, c and update them by taking negative gradient values iteratively till the values of m & c don't change much.
  • [mc]new\begin{bmatrix} m \\ \\ c \end{bmatrix}^{new}=[mc]old\begin{bmatrix} m \\ \\ c \end{bmatrix}^{old} η[JmJc][mc]old-\eta \begin{bmatrix} \frac{\partial J}{\partial m} \\ \\ \frac{\partial J}{\partial c} \\ \end{bmatrix}_{\begin{bmatrix}m\\c\end{bmatrix}^{old}}

Random Variables

A random variable is a variable whose possible values are numerical outcomes of a random phenomenon or event.

There are two types of random variables

  • Discrete like numbers on a dice
  • Continuous like length of hair

Maximum Likelihood Cost Function

Suppose we have n samples from independent and identically distributed observations, coming from an unknown probability density function f(x|theta), where theta is unknown.

So, how to arrive at the log-likelihood function? Here are the steps to sum it up:

  1. Making a Joint Density Function.

    J(θ)=P(x1,x2,,xn;θ)\displaystyle J(\theta) = P(x_1, x_2, \ldots, x_n; \theta)

  2. Finding the Likelihood function, x samples are fixed “parameters” and theta will be the function’s variable.

    L(θ)=i=1NP(x1;θ)\displaystyle L(\theta) = \prod_{i=1}^{N}P(x_1;\theta) \bigg| Independence Assumption

  3. For further simplification, we make it Log-Likelihood function, as it's easier to deal with log.

    l(θ)=log(L(θ))=i=1NlogP(xi;θ)\displaystyle l(\theta) = log(L(\theta)) = \sum_{i=1}^NlogP(x_i;\theta) \bigg| taking log

We can take log because log is a monotonic function. We observe that the max/min values obtained through log is also applicable to the product form. Moreover, it helps in simplifying the equation.

What is MLE

MLE is basically a technique to find the parameters that maximise the likelihood of observing the data points assuming they were generated through a given distribution.

Finding Parameters of Gaussian Distribution through MLE

  • Parameters for Gaussian Distribution are μ\mu and σ\sigma
  • PDF for Normal Distribution:
    f(x)=1σ2πe(xμ)22σ2\displaystyle f(x) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}

P(xi;θ)=p(xi;μ,σ)=1σ2πe(xμ)22σ2\displaystyle P(x_i;\theta) = p(x_i; \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}

l(θ)=ln(p(xi;θ))=J(μ,σ)=i=1N[ln12πσ(xiμ)22σ2]\displaystyle l(\theta) = \ln(p(x_i;\theta)) = J(\mu, \sigma) = \sum_{i=1}^N\bigg[\ln\frac{1}{\sqrt{2\pi}\sigma} - \frac{(x_i-\mu)^2}{2\sigma^2}\bigg]

Now we have a cost function, we can use iterative or closed form method to solve the equation.

  • J(μ,σ)=i=1N[ln12πσ(xiμ)22σ2]\displaystyle J(\mu, \sigma) = \sum_{i=1}^N\bigg[\ln\frac{1}{\sqrt{2\pi}\sigma} - \frac{(x_i-\mu)^2}{2\sigma^2}\bigg]

    =i=1Nln(σ2π)(xiμ)22σ2\displaystyle =-\sum_{i=1}^{N}ln(\sigma\sqrt{2\pi}) - \frac{(x_i-\mu)^2}{2\sigma^2}

  • Jμ=0    Jμ(i=1N(xiμ)22σ2)=0\displaystyle \frac{\partial J}{\partial \mu} = 0 \implies \frac{\partial J}{\partial \mu}(-\sum_{i=1}^N\frac{(x_i-\mu)^2}{2\sigma^2}) = 0

    =12σ2i=1N2(xiμ)(1)=0\displaystyle = \frac{-1}{2\sigma^2}\sum_{i=1}^{N}2(x_i-\mu)(-1) = 0

  • i=1Nxi=i=1Nμ^    μ^=xiN\displaystyle \sum_{i=1}^{N}x_i = \sum_{i=1}^{N}\hat \mu \implies \boxed{\hat \mu = \frac{\sum x_i}{N}}

Similarly, we can find out the value for standard deviation (σ\sigma)

  • σ^=i=1N(xiμ^)2N\boxed{\displaystyle \hat \sigma = \sqrt{\frac{\sum_{i=1}^{N}(x_i-\hat \mu)^2}{N}}}

Note

Ordinary Least Squares is the Maximum Likelihood for a Linear Model.

Maximum Likelihood Estimate for Discrete Distributions

Bernoulli Distribution

The Bernoulli distribution models events with two possible outcomes: either success or failure.

  • PMF of Bernoulli Distribution

    • p, if x = 1
    • 1-p, if x = 0
    • 0, if x does not belong in RxR_x (Support)

The likelihood for p based on X is defined as the joint probability distribution of X1, X2, . . . , Xn. Since X1, X2, . . . , Xn are i.i.d random variables, the joint distribution is

L(p;x)f(x;p)=i=1nf(xi;p)=i=1npx(1p)1x\displaystyle L(p;x) \approx f(x;p) = \prod_{i=1}^{n}f(x_i;p) = \prod_{i=1}^{n}p^x(1-p)^{1-x}

Taking log,

lnf=xilnp+(nxi)ln(1p)\displaystyle \ln f = \sum xi \ln p + (n - \sum x_i)\ln(1-p)

Differentiating the log of L(p; x) with respect to p and setting the derivative to zero shows that this function achieves a maximum at

p^=i=1Nxin\displaystyle \hat p = \sum_{i=1}^{N}\frac{x_i}{n}

Logistic Regression

1. Starting with log form of MLE

  • p(y;p)=i=1N(yi;p)\displaystyle p(y;p) = \prod_{i=1}^{N}(y_i;\underline p)
  • l(p)=[yilnp+(1yi)ln(1p)]l(p) = \sum[y_i \ln \underline p + (1-y_i)\ln (1-\underline p)]

2. Sigmoid Function

  • P(yixi)=σ(β0+β1xi,1)\displaystyle P(y_{i}|x_{i}) = \sigma(\beta_0 + \beta_1x_{i,1})

    • Here xi,j:i    ith sample and jthfeaturex_{i,j}: i \implies i^{th}\text{ sample and } j^{th} \text{feature}
  • σ(β0+β1xi,1)\displaystyle \sigma(\beta_0+\beta_1x_{i,1}) can be written as σβTxi\sigma \beta^{T}x_{i}
  • σ(βTxi)=11+eβTxi=eβTxi1+eβTxi\displaystyle \sigma(\beta^{T}x_{i}) = \frac{1}{1+e^{-\beta^{T}x_{i}}} = \frac{e^{\beta^{T}x_{i}}}{1+e^{\beta^{T}x_{i}}}
  • P(1yixi)=1eβTxi1+eβTxi=11+eβTxi\displaystyle P(1-y_{i}|x_{i}) = 1-\frac{e^{\beta^{T}x_{i}}}{1+e^{\beta^{T}x_{i}}} = \frac{1}{1+e^{\beta^{T}x_{i}}}
  • Note that we found out:

    • P(yixi)=eβTxi1+eβTxi\displaystyle \Large \boxed {P(y_{i}|x_{i}) = \frac{e^{\beta^{T}x_{i}}}{1+e^{\beta^{T}x_{i}}}}
    • P(1yixi)=11+eβTxi\displaystyle \Large \boxed {P(1-y_{i}|x_{i}) = \frac{1}{1+e^{\beta^{T}x_{i}}}}

3. Substituting the values in log\log expression

  • l(β)=i=1Nln(1+eβTxi)+i=1N(yiβTxi)\displaystyle l(\beta) = - \sum_{i=1}^{N}\ln(1+e^{\beta^{T}x_{i}}) + \sum_{i=1}^{N}(y_{i}\beta^{T}x_{i})

4. We can take partial derivate w.r.t β0  &β1\beta_0 \;\& \beta_1 and solve the equations using gradient descent

  • l(β)β1=i=1NeβTxi1+eβTxixi,1+i=1Nyixi,1\displaystyle \frac{\partial l(\beta)}{\partial \beta_1} = -\sum_{i=1}^{N}\frac{e^{\beta^{T}x_{i}}}{1+e^{\beta^{T}x_{i}}}x_{i,1} + \sum_{i=1}^{N}y_{i}x_{i,1}
  • l(β)βj=i=1N(yip(yixi))xi,1\large \displaystyle \frac{\partial l(\beta)}{\partial \beta_j} = \sum_{i=1}^{N}(y_{i} - p(y_{i}|x_{i}))x_{i,1}

    • yiy_{i} - actual value
    • p(yixi)p(y_{i}|x_{i}) - predicted value
    • yip(yixi)y_{i} - p(y_{i}|x_{i}) - error term
  • After simplification, it will turn into simple matrix multiplication, where we will take transpose of the β\beta matrix and multiply it with the error matrix.
  • [e1,e2,,en][x1x2xn]\begin{bmatrix}e_1, e_2,\ldots,e_n\end{bmatrix} \begin{bmatrix}x_1\\x_2\\\ldots\\xn\end{bmatrix}
  • [x1,x2,,xn][e1e2en]\begin{bmatrix}x_1,x_2,\ldots,xn\end{bmatrix}\begin {bmatrix}e_1\\ e_2\\\ldots\\e_n\end{bmatrix}
  • βl(β)=[l(β)β0l(β)β1]=[x1,0,x2,0,,xn,0x1,1,x2,1,,xn,1][e1e2en]=XTe\large \triangledown_{\beta}l(\beta) = \begin{bmatrix}\frac{\partial l(\beta)}{\partial \beta_0} \\ \\ \frac{\partial l(\beta)}{\partial \beta_1} \end{bmatrix} = \begin{bmatrix}x_{1,0}, x_{2,0},\ldots,x_{n,0} \\ x_{1,1}, x_{2,1},\ldots,x_{n,1} \end{bmatrix} \begin{bmatrix}e_1\\e_2\\\ldots\\e_n\end{bmatrix} =X^{T}e
  • The matrix multiplication can be carried by np.matmul in python
  • Shape of X will be N×(d+1)N \times (d+1) where

    • d is the number of features
    • N is the number of observations
  • Shape of XTX^{T} (X-Transpose) will be (d+1)×N(d+1) \times N where d is the number of features

    • d is the number of features
    • N is the number of observations

Obtaining Parameters using Newton Raphson Method

  • It is a second order method
  • Gradient Descent: βjnew=βjoldηl(β)βj\displaystyle \beta_j^{new} = \beta_j^{old} - \eta \frac{\partial l(\beta)}{\partial \beta_j}
  • Newton Raphson: βjnew=βjoldH1l(β)βj\displaystyle \beta_j^{new} = \beta_j^{old} - H^{-1} \frac{\partial l(\beta)}{\partial \beta_j}

Elements of Hessian Matrix

  • l(β)βkβj\displaystyle \large \frac{\partial l(\beta)}{\partial \beta_k \partial \beta_j}
  • l(β)βj=i=1N(yip(yixi))xi,1\large \displaystyle \frac{\partial l(\beta)}{\partial \beta_j} = \sum_{i=1}^{N}(y_{i} - p(y_{i}|x_{i}))x_{i,1}
  • l(β)βkβj=i=1N(1+eβTxi)eβTxixi,keβTxieβTxixi,k(1+eβTxi)2xi,j\displaystyle \large \frac{\partial l(\beta)}{\partial \beta_k \partial \beta_j} = \sum_{i=1}^{N}\frac{(1+e^{\beta^{T}x_{i}})e^{\beta^{T}x_{i}} x_{i,k} - e^{\beta^{T}x_{i}}e^{\beta^{T}x_{i}}x_{i,k}}{(1+e^{\beta^{T}x_{i}})^2}x_{i,j}
  • i=1NP(yixi)+(P(yixi))2xi,k.xi,j\displaystyle \sum_{i=1}^{N}P(y_{i}|x_{i}) + (P(y_{i}|x_{i}))^2x_{i,k}.x_{i,j}
  • i=1N(P(yixi)(1P(yixi)))\displaystyle -\sum_{i=1}^{N}(P(y_{i}|x_{i})(1-P(y_{i}|x_{i})))

Application of MLE: Naive Bayes Classifier

Two class classifier:

  • P(C=C1x)\displaystyle P(C=C_1|x)
  • P(C=C2x)\displaystyle P(C=C_2|x)
  • P(C=C1x)+P(C=C2x)=1\displaystyle P(C=C_1|x) + P(C=C_2|x) = 1

posterior probability=conditional probability.prior probabilityevidence\displaystyle \text{posterior probability} = \frac{\text{conditional probability.prior probability}}{evidence}

  • The posterior probability, in the context of a classification problem, can be interpreted, “given the feature vector xix_i what is the probability of sample "ii" belonging to class "cjc_j”?
  • P(Femalex)=P(xFemale).P(Female)P(x)\displaystyle P(\text{Female}|x) = \frac{P(x|\text{Female}).P(\text{Female})}{P(x)}
  • P(Malex)=P(xMale).P(Male)P(x)\displaystyle P(\text{Male}|x) = \frac{P(x|\text{Male}).P(\text{Male})}{P(x)}
  • If P(Femalex)>P(Malex)P(\text{Female}|x) > P(\text{Male}|x), we say that the person is female
  • We don't need to care about P(x) when doing comparison
  • Learning: from the data, we can find out μm,μf,σm,σf\mu_m, \mu_f, \sigma_m, \sigma_f by fitting Gaussian distributions over each class

Assumptions

  • Distribution is Gaussian

    • if the distribution is not Gaussian, then the predictions will be bad
  • Naive Assumption

    • features are conditionally independent
    • P(XY,Z)=P(XZ)\displaystyle P(X|Y,Z) = P(X|Z)
    • lets us break down features into products
    • P(x1,x2c)=P(x1c).P(x2c)\displaystyle P(x_1,x_2|c) = P(x_1|c).P(x_2|c)

References

Thank the author. Fork this blog.


Tagged in pythonmachine-learningpredictive-analysis