Important Topics from Stats and ML

Random Variable

It is difficult to read the behavior of the complete population due to time and cost issue
We can use inferential statistics where we infer information about population from sample
Example: Lead content in Maggi Packets

$\mu \pm \frac{Z^*\sigma}{\sqrt n}$

Hypothesis: A claim or an assumption \thetaat we make about one or more population parameters
Types of Hypothesis:
- Null Hypothesis ($H_0$)
  - makes an asummption about the status quo
  - always contains the symbols $=, \leq, \geq$
- Alternate Hypothesis ($H_1$)
  - challenges and complements the null hypothesis
  - always contains the symbols $\neq, <, >$
$H_0 & H_1$ are disjoint events

The critical region lies on both sides of the distribution
The alternate hypothesis contains $\neq$ symbol
If alpha is 0.5, p-value range (cumulative over sections): [0, 0.025, 50, 0.975, 1]

Types of Data
- Public Data: data collected by the government or other public agencies that are made public for the purposes of research are known as Public data
- Private Data: Data generated by Banking, telecom, retail and media are some examples of private data. This data cannot be used freely for analysis
Univariate
- Analyzing one variable at a time
- historgram, countrplot, boxplot
Bivariate
- Analyzing two variables at a time
- scatterplot, barplot, boxplot, pairplot
Multivariate
- Analyzing more than two variables at a time
- barplot, scatterplot, boxplot
Outlier Analysis
Categorical Data
- discrete values
Continuous Data
- range of values

It is the simplest form of regression. It is a technique in which the dependent variable is continuous in nature. The relationship between the dependent variable and independent variables is assumed to be linear in nature.
Simple Linear Regression - one independent variable
Multiple Linear Regression - more than one independent variable

$Adj R^2 = 1 - \frac{(1-R^2)(N-1)}{(N-P-1)}$

based on Bayes theorem
$\displaystyle P(A|G) = \frac{P(G|A)\times P(A)}{P(G)}$
$\displaystyle P(y|(x_1,x_2,\ldots,x_n)) = \frac{\prod_{i=1}^{N}P(x_i|y)\times P(y)}{\prod_{i=1}^{N}P(x_i)}$

Expected Values
Mean, Median, Mode
Probability
CLT
$H_0, H_1$ hypothesis formulation
EDA graph based questions
R^2, Adj R^2, y=mx+c
odds: $\displaystyle y = \frac{1}{1+e^{-\beta_0+\beta_1x_1}}$
sigmoid
recall, precision, specificity, sensitivity, confusion matrix
precision: out of all the 0s you "PREDICTED" (TP+FP), how many are actually zero (TP)
recall: out of all the 0s there actually are (TP+FN), how many you predicted correctly (TP)