Random variable: a numerical characteristic that takes on different values due to chance
Probability distribution: A table, graph, or formula that gives the probability of a given outcome's occurrence
What if we flipped a fair coin four times? What are the possible outcomes and what is the probability of each?
Heads | 0 | 1 | 2 | 3 | 4 |
Probability | 0.0625 | 0.25 | 0.375 | 0.25 | 0.0625 |
Cummulative Probability | 0.0625 | 0.3125 | 0.6875 | 0.9375 | 1 |
\[ P(X=0)=0.0625 \\ P(X<3)= P(X=0)+P(X=1)+P(x=2) \]
A census was conducted at a university. All students were asked how many tattoos they had.
Tattoos | 0 | 1 | 2 | 3 | 4 |
Probability | 0.85 | 0.12 | 0.015 | 0.01 | 0.005 |
\( P(X=0)=.85, \quad P(X=1)=.12, \quad P(X=2)=.015, \) etc.
The Binomial distribution is a discrete probability distribution. It describes the outcome of n independent trials in an experiment. Each trial is assumed to have only two outcomes, either success or failure.
Suppose there are twelve multiple choice questions in an English class quiz (1 of 5 is correct answer). The probability distribution for the number of correct answers when answered randomly.
The Poisson distribution is the probability distribution of independent event occurrences in an interval.
If there are 12 cars crossing a bridge per minute on average, the probability distribution for the number of cars crossing the bridge in a minute.
For eight-week-old babies can smile between 0 and 23 seconds and any smiling time between 0 to 23 is equally likely . The probability distribution for a randomly selected babies' smiling time.
The probability distribution for the height of a randomly selected person.
Probability Mass Functions for Discrete distributions
Probability Density Functions for Continuous Distributions
\( f(x)=\textstyle {n \choose k}\, p^k (1-p)^{n-k} \quad where \)
n= number of trials,
k= number of successes
\( f(x)=\frac{\lambda^k e^{-\lambda}}{k!} \quad where \)
\( \lambda \) = is the average number of events per interval (event rate)
k= number of occurrences
\[ f(x)=\begin{cases} \frac{1}{b - a} & \text{for } x \in [a,b] \\ 0 & \text{otherwise} \end{cases} \]
\[ f(x)=\frac{1}{\sigma\sqrt{2\pi}}\, e^{-\frac{(x - \mu)^2}{2 \sigma^2}} \]
P(x= k)
P(x < k)
To get a full list of the distributions available in R you can use the following command:
For every distribution there are four commands. The commands for each distribution are prepended with a letter to indicate the functionality:
-“d” returns the height of the probability density function
- “p” returns the cumulative density function
- “q” returns the inverse cumulative density function (quantiles)
- “r” returns randomly generated numbers
\( N(150,5) \) is a normal distribution where \( \mu \)= 150 and \( \sigma \)= 5
for density function
x=seq(120,180, 1)
d150= dnorm(150, mean = 150, sd=5)
plot(x, props, type = 'l')
points(150, d150)
\( N(150,5) \) is a normal distribution where \( \mu \)= 150 and \( \sigma \)= 5
for cumulative density function
x=seq(120,180, 1)
(p150= pnorm(150, mean = 150, sd=5))
[1] 0.5
#Area under the curve between min(x) and 150
\( N(150,5) \) is a normal distribution where \( \mu \)= 150 and \( \sigma \)= 5
to get the quantiles - inverse of pnorm
quartile.p=seq(0.25, .75, 0.25)
(quantiles=qnorm(quartile.p, mean=150, sd = 5))
[1] 146.6276 150.0000 153.3724
props=dnorm(x, 150, 5)
plot(x, props, type = 'l')
points(quantiles, dnorm(quantiles, 150, 5))
to generate random values from a distribution
(random.vals=rnorm(5, 150,5))
[1] 161.5864 150.6535 154.6724 148.0559 143.6746
Try to use these functions in binomial distributions
dbinom, pbinom, qbinom, rbinom
b.dist<-plot(y, dbinom(y, 30, 0.5), type = "h", ylab = 'Binomial Dist', main = '# of heads in 30 coin flips')
Suppose there are twelve multiple choice questions in an English class quiz. Each question has five possible answers, and only one of them is correct. Find the probability of having four or less correct answers if a student attempts to answer every question at random.
props= dbinom(vals, 5,1/5)
[1] 0.99968
[1] 0.99968
Assume that the test scores of a college entrance exam fits a normal distribution. Furthermore, the mean test score is 72, and the standard deviation is 15.2. What is the percentage of students scoring 84 or more in the exam?
pnorm(84, mean=72, sd=15.2, lower.tail=FALSE)
[1] 0.2149176
If there are twelve cars crossing a bridge per minute on average, find the probability of having seventeen or more cars crossing the bridge in a particular minute.
P(x >= 17)= 1 - P( x < 16)
(p.less.than.sixteeen = ppois(16 , lambda = 12))
[1] 0.898709
1- p.less.than.sixteeen
[1] 0.101291
(p.greater.than.sixteen = ppois(16 , lambda = 12, lower.tail = F))
[1] 0.101291
[1] "Sex" "Wr.Hnd" "NW.Hnd" "W.Hnd" "Fold" "Pulse" "Clap"
[8] "Exer" "Smoke" "Height" "M.I" "Age"
[1] 237 12
height.survey = na.omit(survey$Height)
[1] 172.3809
Sample randomly drawn from the student body has an average of 172.4. Since this is not average of the entire school most likely \( \mu \) the population mean (true mean) is not exactly 172.4.
\[ \mu_{\bar{x}} = \mu, \ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \]
According to the central limit theorem, the average height for samples of 237 observations follow a normal distribution of N (\( \mu, \sigma/\sqrt{237} \) ).
Our sample mean (172.4) is one of the outcomes in this distribution, we just don't know where it actually is.
Area under the curve can be calculated in reference to the distance in standard deviation from the mean
qnorm(.975) #means: qnorm(.975, mean=0, sd=1)
[1] 1.959964
[1] 1.644854
Assume the population standard deviation \( (\sigma) \) of the student height in survey is 9.48. Find the margin of error and interval estimate at 95% confidence level.
n = length(height.survey)
sd = 9.48
(S.E= sd/sqrt(n)) # standard error of the mean
[1] 0.6557453
(M.E = qnorm(.975)*S.E) # margin of error
[1] 1.285237
xbar = mean(height.survey) # point estimate
xbar + c(-M.E, M.E)
[1] 171.0956 173.6661
The 95% confidence level would imply the 97.5th percentile of the normal distribution at the upper tail. Therefore, 1.96 (\( z_{\alpha/2} \) ) is given qnorm(.975)
\( t_{\alpha/2, n-1}=qt(1-\alpha/2, n-1 ) \)
Without assuming the population standard deviation of the student height in survey, find the margin of error and interval estimate at 95% confidence level.
n = length(height.survey)
sd = sd(height.survey, na.rm = T) # sample standard deviation
(S.E= sd/sqrt(n)) # standard error of the mean
[1] 0.6811677
(M.E = qt(.975, n-1 )*S.E) # margin of error
[1] 1.342878
xbar = mean(height.survey, na.rm = T) # point estimate
xbar + c(-M.E, M.E)
[1] 171.0380 173.7237
The 95% confidence level would imply the 97.5th percentile of the t-distribution where the degrees of freedom is n-1 at the upper tail.
Therefore, \( t_{\alpha/2,n-1}=1.97 \) calculated by qt(.975, 209-1)
Use build-in t-test function
t.test(height.survey, conf.level = .95)
One Sample t-test
data: height.survey
t = 253.0667, df = 208, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
171.0380 173.7237
sample estimates:
mean of x
What proportion of the population is Female?
gender.response = na.omit(survey$Sex)
pbar=sum(gender.response=='Female') / n
#or (pbar=prop.table(table(gender.response))['Female'])
(S.E = sqrt(pbar*(1-pbar)/n)) # standard error
[1] 0.03254723
(M.E = qnorm(.975)*S.E ) # margin of error
[1] 0.06379139
pbar + c(-M.E,M.E)
[1] 0.4362086 0.5637914
Central Limit Theorem The range for the true proportion at (\( 1-\alpha \)) confidence level is
\( \bar{p} \pm z_{\alpha/2}*\sqrt{\frac{\bar{p} (1- \bar{p})}{n}} \)
x=sum(gender.response=='Female') # number of successes
k= length(gender.response) # number of trials
prop.test( x, k , conf.level = .95)
1-sample proportions test without continuity correction
data: x out of k, null probability 0.5
X-squared = 0, df = 1, p-value = 1
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.4367215 0.5632785
sample estimates:
Proportion Mean
Type I and Type II errors
The average height for the sample data is 172.4. About 10 years ago, the average student height is known to be 170.5. Can we say (with some degree of confidence) that the average student height has increased since 10 years ago?
Sample average is slightly higher than the average height 10 years ago. There are two competing hypothesis for the difference:
Null hypothesis \( H_O \): The claim that defends the status quo. Nothing has changed or is different. The sample difference is due to sampling error.
Alternative hypothesis \( H_A \): The claim that there is a difference between . If it is true, than the answer to your research question is 'yes'.\( H_O: \mu=170.5 \) To prove alternative hypothesis, we need to prove that the sample mean observed (172.5) is extremely unlikely if \( H_O \) were true.
\( H_A: \mu>170.5 \)
Has the average height increased ?
\( H_O: \mu=170.5 \)
\( H_A: \mu>170.5 \)
\( \mu=170.5 \) and
\( \bar{x}=172.4 \)
t.test(height.survey,mu = 170.5, alternative = 'greater', conf.level = .95)
One Sample t-test
data: height.survey
t = 2.7612, df = 208, p-value = 0.003137
alternative hypothesis: true mean is greater than 170.5
95 percent confidence interval:
171.2554 Inf
sample estimates:
mean of x
The p-value is the probability of observing data at least as favorable to the alternative hypothesis as our current data set if the null hypothesis was true.
If the p-value is low we say that it would be very unlikely to observe the data if the null hypothesis were true, and hence reject \( H_O \).
If the p-value is high we say that it is likely to observe the data even if the null hypothesis were true, and hence do not \( H_O \).
Has the average height increased ?
\( H_O: \mu=170.5 \)
\( H_A: \mu>170.5 \)
t.test(height.survey,mu = 170.5, alternative = 'greater', conf.level = .95)
Has the average height decreased ?
\( H_O: \mu=170.5 \)
\( H_A: \mu<170.5 \)
t.test(height.survey, mu = 170.5, alternative = 'less', conf.level = .95)
Has the average height changed ?
\( H_O: \mu=170.5 \)
\( H_A: \mu\neq 170.5 \)
t.test(height.survey, mu = 170.5, alternative = 'two.sided', conf.level = .95)
If population variance is known and the sample size is large enough, then we assume that sample means are normally distributed. So instead of a t-test ve perform z-test
z-test is not in the base library of R but there are packages such {BSDA} that has a z-test function
z.test(height.survey, mu = 170.5, sigma.x=9.84, alternative = 'two.sided', conf.level = .95)
One-sample z-Test
data: height.survey
z = 2.7633, p-value = 0.005721
alternative hypothesis: true mean is not equal to 170.5
95 percent confidence interval:
171.0468 173.7149
sample estimates:
mean of x
Only 0.4 of the students were female in the past. Has the female proportion increased ?
\( H_O: p=0.4 \)
\( H_A: p > 0.4 \)
x=sum(gender.response=='Female') # number of successes
n= length(gender.response) # number of trials
prop.test(x, n, p=0.4, alternative = 'greater',conf.level = .95)
1-sample proportions test with continuity correction
data: x out of n, null probability 0.4
X-squared = 9.4211, df = 1, p-value = 0.001073
alternative hypothesis: true p is greater than 0.4
95 percent confidence interval:
0.4446747 1.0000000
sample estimates:
Has the female proportion increased ?
\( H_O: p=0.4 \)
\( H_A: p > 0.4 \)
prop.test(x, n, p=0.4, alternative = 'greater',conf.level = .95)
Has the female proportion decreased ?
\( H_O: p=0.4 \)
\( H_A: p < 0.4 \)
prop.test(x, n, p=0.4, alternative = 'less',conf.level = .95)
Has the female proportion changed ?
\( H_O: p=0.4 \)
\( H_A: p \neq 0.4 \)
prop.test(x, n, p=0.4, alternative = 'two.sided',conf.level = .95)
In a hypothesis test, we make a decision about which might be true, but our choice might be incorrect.
A Type I Error is rejecting the null hypothesis when \( H_0 \) is true.
A Type II Error is failing to reject the null hypothesis when \( H_A \) is true.