A random variable is a variable whose possible values are numerical outcomes of a random phenomenon or event.
There are two types of random variables
Discrete like numbers on a dice
Continuous like length of hair
Maximum Likelihood Cost Function
Suppose we have n samples from independent and identically distributed observations, coming from an unknown probability density function f(x|theta), where theta is unknown.
So, how to arrive at the log-likelihood function? Here are the steps to sum it up:
Making a Joint Density Function.
J(θ)=P(x1,x2,…,xn;θ)
Finding the Likelihood function, x samples are fixed “parameters” and theta will be the function’s variable.
L(θ)=i=1∏NP(x1;θ)∣∣∣∣∣ Independence Assumption
For further simplification, we make it Log-Likelihood function, as it's easier to deal with log.
l(θ)=log(L(θ))=i=1∑NlogP(xi;θ)∣∣∣∣∣ taking log
We can take log because log is a monotonic function. We observe that the max/min values obtained through log is also applicable to the product form. Moreover, it helps in simplifying the equation.
What is MLE
MLE is basically a technique to find the parameters that maximise the likelihood of observing the data points assuming they were generated through a given distribution.
Finding Parameters of Gaussian Distribution through MLE
Parameters for Gaussian Distribution are μ and σ
PDF for Normal Distribution: f(x)=σ2π1e−2σ2(x−μ)2
Now we have a cost function, we can use iterative or closed form method to solve the equation.
J(μ,σ)=i=1∑N[ln2πσ1−2σ2(xi−μ)2]
=−i=1∑Nln(σ2π)−2σ2(xi−μ)2
∂μ∂J=0⟹∂μ∂J(−i=1∑N2σ2(xi−μ)2)=0
=2σ2−1i=1∑N2(xi−μ)(−1)=0
i=1∑Nxi=i=1∑Nμ^⟹μ^=N∑xi
Similarly, we can find out the value for standard deviation (σ)
σ^=N∑i=1N(xi−μ^)2
Note
Ordinary Least Squares is the Maximum Likelihood for a Linear Model.
Maximum Likelihood Estimate for Discrete Distributions
Bernoulli Distribution
The Bernoulli distribution models events with two possible outcomes: either success or failure.
PMF of Bernoulli Distribution
p, if x = 1
1-p, if x = 0
0, if x does not belong in Rx (Support)
The likelihood for p based on X is defined as the joint probability distribution of X1, X2, . . . , Xn. Since X1, X2, . . . , Xn are i.i.d random variables, the joint distribution is
L(p;x)≈f(x;p)=i=1∏nf(xi;p)=i=1∏npx(1−p)1−x
Taking log,
lnf=∑xilnp+(n−∑xi)ln(1−p)
Differentiating the log of L(p; x) with respect to p and setting the derivative to zero shows that this function achieves a maximum at
p^=i=1∑Nnxi
Logistic Regression
1. Starting with log form of MLE
p(y;p)=i=1∏N(yi;p)
l(p)=∑[yilnp+(1−yi)ln(1−p)]
2. Sigmoid Function
P(yi∣xi)=σ(β0+β1xi,1)
Here xi,j:i⟹ith sample and jthfeature
σ(β0+β1xi,1) can be written as σβTxi
σ(βTxi)=1+e−βTxi1=1+eβTxieβTxi
P(1−yi∣xi)=1−1+eβTxieβTxi=1+eβTxi1
Note that we found out:
P(yi∣xi)=1+eβTxieβTxi
P(1−yi∣xi)=1+eβTxi1
3. Substituting the values in log expression
l(β)=−i=1∑Nln(1+eβTxi)+i=1∑N(yiβTxi)
4. We can take partial derivate w.r.t β0&β1 and solve the equations using gradient descent
After simplification, it will turn into simple matrix multiplication, where we will take transpose of the β matrix and multiply it with the error matrix.
The posterior probability, in the context of a classification problem, can be interpreted, “given the feature vector xi what is the probability of sample "i" belonging to class "cj”?
P(Female∣x)=P(x)P(x∣Female).P(Female)
P(Male∣x)=P(x)P(x∣Male).P(Male)
If P(Female∣x)>P(Male∣x), we say that the person is female
We don't need to care about P(x) when doing comparison
Learning: from the data, we can find out μm,μf,σm,σf by fitting Gaussian distributions over each class
Assumptions
Distribution is Gaussian
if the distribution is not Gaussian, then the predictions will be bad