Inferential Statistics

BUAN 327
Yegin Genc

Agenda

Probability Distributions

  • The Normal Distribution
  • The t Distribution
  • The Binomial Distribution

Random Variables

Random variable: a numerical characteristic that takes on different values due to chance

  • A discrete random variable has a countable set of distinct possible values.
  • A continuous random variable is such that any value (to any number of decimal places) within some interval is a possible value.

Probability distribution: A table, graph, or formula that gives the probability of a given outcome's occurrence

Probability Distributions

Discrete Probability Distributions

  • Binomial Distributions
  • Poisson Distributions

Continuous Probability Distributions

  • Normal Distributions
  • Uniform Distributions
  • Exponential Distributions

Discrete Probability Distributions

What if we flipped a fair coin four times? What are the possible outcomes and what is the probability of each?

Heads 0 1 2 3 4
Probability 0.0625 0.25 0.375 0.25 0.0625
Cummulative Probability 0.0625 0.3125 0.6875 0.9375 1

\[ P(X=0)=0.0625 \\ P(X<3)= P(X=0)+P(X=1)+P(x=2) \]

A census was conducted at a university. All students were asked how many tattoos they had.

Tattoos 0 1 2 3 4
Probability 0.85 0.12 0.015 0.01 0.005

\( P(X=0)=.85, \quad P(X=1)=.12, \quad P(X=2)=.015, \) etc.

Two common discrete distributions

  • The Binomial distribution is a discrete probability distribution. It describes the outcome of n independent trials in an experiment. Each trial is assumed to have only two outcomes, either success or failure.

    Suppose there are twelve multiple choice questions in an English class quiz (1 of 5 is correct answer). The probability distribution for the number of correct answers when answered randomly.

  • The Poisson distribution is the probability distribution of independent event occurrences in an interval.

    If there are 12 cars crossing a bridge per minute on average, the probability distribution for the number of cars crossing the bridge in a minute.

Common continuous distributions

  • Uniform Distribution is a continuous distribution that has constant probability (every possible outcome has an equal chance, or likelihood, of occurring).

For eight-week-old babies can smile between 0 and 23 seconds and any smiling time between 0 to 23 is equally likely . The probability distribution for a randomly selected babies' smiling time.

  • Normal Distribution is a continuous distribution that follows a commonly observed, natural probability pattern.

The probability distribution for the height of a randomly selected person.

Probability Density Functions

Probability Mass Functions for Discrete distributions

plot of chunk unnamed-chunk-4

Probability Density Functions for Continuous Distributions

plot of chunk unnamed-chunk-5

Probability Mass Functions

Binomial

\( f(x)=\textstyle {n \choose k}\, p^k (1-p)^{n-k} \quad where \)
n= number of trials,
k= number of successes
plot of chunk unnamed-chunk-6

Poisson

\( f(x)=\frac{\lambda^k e^{-\lambda}}{k!} \quad where \)
\( \lambda \) = is the average number of events per interval (event rate)
k= number of occurrences
plot of chunk unnamed-chunk-7

Probability Density Functions

Uniform

\[ f(x)=\begin{cases} \frac{1}{b - a} & \text{for } x \in [a,b] \\ 0 & \text{otherwise} \end{cases} \]

plot of chunk unnamed-chunk-8

Normal

\[ f(x)=\frac{1}{\sigma\sqrt{2\pi}}\, e^{-\frac{(x - \mu)^2}{2 \sigma^2}} \]

plot of chunk unnamed-chunk-9

Probabilities in Distributions

plot of chunk unnamed-chunk-10

P(x= k)

  • Discrete distributions by PMF (Probability Mass Function)
  • Continuous distributions P(x=k)=0

P(x < k)

  • Discrete distributions:
    P(x= k-1 ) + P(x = k-2) + … + P( x= min)
  • Continuous distributions:
    Area under the curve (btw min(x) and PDF(k))

Distributions in R

To get a full list of the distributions available in R you can use the following command:

  • help(Distributions)

For every distribution there are four commands. The commands for each distribution are prepended with a letter to indicate the functionality:

-“d” returns the height of the probability density function

  • “p” returns the cumulative density function
  • “q” returns the inverse cumulative density function (quantiles)
  • “r” returns randomly generated numbers

Example - Normal Distributions

\( N(150,5) \) is a normal distribution where \( \mu \)= 150 and \( \sigma \)= 5
plot of chunk unnamed-chunk-11

dnorm() for density function

x=seq(120,180, 1)
d150= dnorm(150, mean = 150, sd=5)
plot(x, props, type = 'l')
points(150, d150)

plot of chunk unnamed-chunk-12

\( N(150,5) \) is a normal distribution where \( \mu \)= 150 and \( \sigma \)= 5
plot of chunk unnamed-chunk-13

pnorm() for cumulative density function

x=seq(120,180, 1)
(p150= pnorm(150, mean = 150, sd=5))
[1] 0.5
#Area under the curve between min(x) and 150

\( N(150,5) \) is a normal distribution where \( \mu \)= 150 and \( \sigma \)= 5
plot of chunk unnamed-chunk-15

qnorm() to get the quantiles - inverse of pnorm

quartile.p=seq(0.25, .75, 0.25)
(quantiles=qnorm(quartile.p, mean=150, sd = 5))
[1] 146.6276 150.0000 153.3724
props=dnorm(x, 150, 5)
plot(x, props, type = 'l')
points(quantiles, dnorm(quantiles, 150, 5))

plot of chunk unnamed-chunk-16

rnorm() to generate random values from a distribution

(random.vals=rnorm(5, 150,5))
[1] 161.5864 150.6535 154.6724 148.0559 143.6746

Try to use these functions in binomial distributions
dbinom, pbinom, qbinom, rbinom

y<-1:30
b.dist<-plot(y, dbinom(y, 30, 0.5), type = "h", ylab = 'Binomial Dist', main = '# of heads in 30 coin flips')

plot of chunk unnamed-chunk-18

Suppose there are twelve multiple choice questions in an English class quiz. Each question has five possible answers, and only one of them is correct. Find the probability of having four or less correct answers if a student attempts to answer every question at random.

vals=0:4
props= dbinom(vals, 5,1/5)
sum(props)
[1] 0.99968
#or
pbinom(4,5,1/5)
[1] 0.99968

Assume that the test scores of a college entrance exam fits a normal distribution. Furthermore, the mean test score is 72, and the standard deviation is 15.2. What is the percentage of students scoring 84 or more in the exam?

pnorm(84, mean=72, sd=15.2, lower.tail=FALSE) 
[1] 0.2149176

If there are twelve cars crossing a bridge per minute on average, find the probability of having seventeen or more cars crossing the bridge in a particular minute.

P(x >= 17)= 1 - P( x < 16)

(p.less.than.sixteeen = ppois(16 , lambda = 12))
[1] 0.898709
1- p.less.than.sixteeen
[1] 0.101291
#or
(p.greater.than.sixteen = ppois(16 , lambda = 12, lower.tail = F))
[1] 0.101291

Confidence Intervals

  • Interval Estimate of Population Mean ( \( \mu \) )
    – with Known Variance (\( \sigma^2 \))
    – with Unknown Variance
  • Interval Estimate of Population Proportion ( p)

Running example

library(MASS) 
#head(survey)
names(survey)
 [1] "Sex"    "Wr.Hnd" "NW.Hnd" "W.Hnd"  "Fold"   "Pulse"  "Clap"  
 [8] "Exer"   "Smoke"  "Height" "M.I"    "Age"   
dim(survey)
[1] 237  12
height.survey = na.omit(survey$Height)
(x_bar=mean(height.survey))
[1] 172.3809

Sample randomly drawn from the student body has an average of 172.4. Since this is not average of the entire school most likely \( \mu \) the population mean (true mean) is not exactly 172.4.

The Central Limit Theorem (reminder)

  • The Central Limit Theorem (CLT) is one of the most important theorems in statistics
  • The useful way to think about the CLT is that \( \bar X_n \) is approximately \( N(\mu, \sigma^2 / n) \). That is, the sample mean, \( \bar X \), is approximately normal with mean (\( \mu \)) and sd (\( \sigma / \sqrt{n} \))

plot of chunk unnamed-chunk-23

\[ \mu_{\bar{x}} = \mu, \ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} \]

Central Limit Theorem

According to the central limit theorem, the average height for samples of 237 observations follow a normal distribution of N (\( \mu, \sigma/\sqrt{237} \) ).

plot of chunk unnamed-chunk-24

Our sample mean (172.4) is one of the outcomes in this distribution, we just don't know where it actually is.

Confidence Intervals

  • \( \mu + 2 \sigma /\sqrt{n} \) is pretty far out in the tail (only 2.5% of a normal being larger than 2 sds in the tail)
  • Similarly, \( \mu - 2 \sigma /\sqrt{n} \) is pretty far in the left tail (only 2.5% chance of a normal being smaller than 2 sds in the tail)
  • So the probability \( \bar X \) is bigger than \( \mu + 2 \sigma / \sqrt{n} \) or smaller than \( \mu - 2 \sigma / \sqrt{n} \) is 5%
    • Or equivalently, the probability of being between these limits is 95%
  • The quantity \( \bar X \pm 2 \sigma /\sqrt{n} \) is called a 95% interval for \( \mu \)
  • The 95% refers to the fact that if one were to repeatedly get samples of size \( n \), about 95% of the intervals obtained would contain \( \mu \)

plot of chunk unnamed-chunk-25 Area under the curve can be calculated in reference to the distance in standard deviation from the mean

Confidence Intervals (Contd.)

plot of chunk unnamed-chunk-26

  • For a conf. level of C (say 95%), the confidence interval should be z many standard errors (\( \sigma/\sqrt{n} \)) far out in each tail.
    • z = quantile(1 - (1-C)/2).
    • (1-C) is also known as \( \alpha \), so z=qnorm(1 - \( \alpha/2 \))

  • The 97.5th quantile is 1.96 (so rounded to 2 above)
qnorm(.975) #means: qnorm(.975, mean=0, sd=1)
[1] 1.959964
  • For 90% interval you want (100 - 90) / 2 = 5% in each tail, so you want the 95th percentile (1.645)
qnorm(.95) 
[1] 1.644854

C.I. with known variance (z-test)

Assume the population standard deviation \( (\sigma) \) of the student height in survey is 9.48. Find the margin of error and interval estimate at 95% confidence level.

n = length(height.survey) 
sd = 9.48
(S.E= sd/sqrt(n))               # standard error of the mean 
[1] 0.6557453
(M.E = qnorm(.975)*S.E)         # margin of error 
[1] 1.285237
xbar = mean(height.survey)      # point estimate
xbar + c(-M.E, M.E)
[1] 171.0956 173.6661

The 95% confidence level would imply the 97.5th percentile of the normal distribution at the upper tail. Therefore, 1.96 (\( z_{\alpha/2} \) ) is given qnorm(.975) plot of chunk unnamed-chunk-30

Confidence intervals with unknown variance (or small n)

  • If population standard deviation \( (\sigma) \) is not known then we use sample standard dev. (s) instead. However sample averages (with respect to s) are not normally distributed especially when sample size is small. Their distribution is called 't-distribution'
  • Is indexed by a degrees of freedom; gets more like a standard normal as df gets larger
  • Interval for a confidence level C (or \( 1-\alpha \)) is \( \bar X \pm t_{\alpha/2, n-1} S/\sqrt{n} \) where \( t_{n-1} \) is the relevant quantile

plot of chunk unnamed-chunk-31

\( t_{\alpha/2, n-1}=qt(1-\alpha/2, n-1 ) \)

t-test

Without assuming the population standard deviation of the student height in survey, find the margin of error and interval estimate at 95% confidence level.

n = length(height.survey) 
sd = sd(height.survey, na.rm = T)  # sample standard deviation 
(S.E= sd/sqrt(n))                  # standard error of the mean 
[1] 0.6811677
(M.E = qt(.975, n-1 )*S.E)               # margin of error 
[1] 1.342878
xbar = mean(height.survey, na.rm = T)      # point estimate
xbar + c(-M.E, M.E)
[1] 171.0380 173.7237

The 95% confidence level would imply the 97.5th percentile of the t-distribution where the degrees of freedom is n-1 at the upper tail.

Therefore, \( t_{\alpha/2,n-1}=1.97 \) calculated by qt(.975, 209-1)

t-test (shortcut)

Use build-in t-test function

t.test(height.survey, conf.level = .95)

    One Sample t-test

data:  height.survey
t = 253.0667, df = 208, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 171.0380 173.7237
sample estimates:
mean of x 
 172.3809 

proportion test

What proportion of the population is Female?

  • First sample proportion \( \bar{p}_{female} \)
gender.response = na.omit(survey$Sex)
n=length(gender.response)
pbar=sum(gender.response=='Female') / n
#or (pbar=prop.table(table(gender.response))['Female']) 
  • Then the confidence interval
(S.E = sqrt(pbar*(1-pbar)/n))    # standard error
[1] 0.03254723
(M.E = qnorm(.975)*S.E )         # margin of error
[1] 0.06379139
pbar + c(-M.E,M.E)
[1] 0.4362086 0.5637914

Central Limit Theorem The range for the true proportion at (\( 1-\alpha \)) confidence level is

\( \bar{p} \pm z_{\alpha/2}*\sqrt{\frac{\bar{p} (1- \bar{p})}{n}} \)

proportion test (shortcut)

x=sum(gender.response=='Female') # number of successes
k= length(gender.response)  # number of trials
prop.test( x, k , conf.level = .95) 

    1-sample proportions test without continuity correction

data:  x out of k, null probability 0.5
X-squared = 0, df = 1, p-value = 1
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.4367215 0.5632785
sample estimates:
  p 
0.5 

Hypothesis Test

  • Population Mean
    • One-tailed vs. two-tailed
  • Proportion Mean

    • One-tailed vs two-tailed
  • Type I and Type II errors

Hypothesis Testing

The average height for the sample data is 172.4. About 10 years ago, the average student height is known to be 170.5. Can we say (with some degree of confidence) that the average student height has increased since 10 years ago?

Research question: is \( (\mu > 170.5) \) ?

Sample average is slightly higher than the average height 10 years ago. There are two competing hypothesis for the difference:

  • Current population average is still 170.5 (\( \mu=170.5 \)). Random selection led to picking students that are taller than the average.
  • Students today are on average taller than 170.5 (\( \mu=170.5 \)) therefor the sample average is grater 170.5

Null hypothesis \( H_O \): The claim that defends the status quo. Nothing has changed or is different. The sample difference is due to sampling error.
Alternative hypothesis \( H_A \): The claim that there is a difference between . If it is true, than the answer to your research question is 'yes'.

\( H_O: \mu=170.5 \)
\( H_A: \mu>170.5 \)
To prove alternative hypothesis, we need to prove that the sample mean observed (172.5) is extremely unlikely if \( H_O \) were true.

Has the average height increased ?

\( H_O: \mu=170.5 \)
\( H_A: \mu>170.5 \)

\( \mu=170.5 \) and
\( \bar{x}=172.4 \) mean(height.survey)

t.test(height.survey,mu = 170.5, alternative = 'greater', conf.level = .95)

    One Sample t-test

data:  height.survey
t = 2.7612, df = 208, p-value = 0.003137
alternative hypothesis: true mean is greater than 170.5
95 percent confidence interval:
 171.2554      Inf
sample estimates:
mean of x 
 172.3809 
  • t: how many standard errors is 172.4 from the hypothesized 170.5
    \( t=(\bar{x} - \mu)/ (s/\sqrt{n}) \)
  • p-value: the probability of observing what we observed or something more extreme if the null hypothesis were true.
    \( P(\bar{x}>172.4 \quad | \quad \mu=170.5) \)
    qt(t, n-1, lower.tail = T)

Decisions based on the p value

  • The p-value is the probability of observing data at least as favorable to the alternative hypothesis as our current data set if the null hypothesis was true.

  • If the p-value is low we say that it would be very unlikely to observe the data if the null hypothesis were true, and hence reject \( H_O \).

  • If the p-value is high we say that it is likely to observe the data even if the null hypothesis were true, and hence do not \( H_O \).

One sided vs Two sided

One-Sided

Has the average height increased ?

\( H_O: \mu=170.5 \)
\( H_A: \mu>170.5 \)

t.test(height.survey,mu = 170.5, alternative = 'greater', conf.level = .95)

Has the average height decreased ?

\( H_O: \mu=170.5 \)
\( H_A: \mu<170.5 \)

t.test(height.survey, mu = 170.5, alternative = 'less', conf.level = .95)

Two-sided

Has the average height changed ?

\( H_O: \mu=170.5 \)
\( H_A: \mu\neq 170.5 \)

t.test(height.survey, mu = 170.5, alternative = 'two.sided', conf.level = .95)

Hypothesis Tests with Known Variance (\(\sigma\))

  • If population variance is known and the sample size is large enough, then we assume that sample means are normally distributed. So instead of a t-test ve perform z-test

  • z-test is not in the base library of R but there are packages such {BSDA} that has a z-test function

require(BSDA)
z.test(height.survey, mu = 170.5,  sigma.x=9.84, alternative = 'two.sided', conf.level = .95)

    One-sample z-Test

data:  height.survey
z = 2.7633, p-value = 0.005721
alternative hypothesis: true mean is not equal to 170.5
95 percent confidence interval:
 171.0468 173.7149
sample estimates:
mean of x 
 172.3809 

Hypothesis test for Proportions

Only 0.4 of the students were female in the past. Has the female proportion increased ?

\( H_O: p=0.4 \)
\( H_A: p > 0.4 \)

x=sum(gender.response=='Female') # number of successes
n= length(gender.response)  # number of trials
prop.test(x, n, p=0.4, alternative = 'greater',conf.level = .95)

    1-sample proportions test with continuity correction

data:  x out of n, null probability 0.4
X-squared = 9.4211, df = 1, p-value = 0.001073
alternative hypothesis: true p is greater than 0.4
95 percent confidence interval:
 0.4446747 1.0000000
sample estimates:
  p 
0.5 

One-sided vs Two-sided for proportions

One-Sided

Has the female proportion increased ?

\( H_O: p=0.4 \)
\( H_A: p > 0.4 \)

prop.test(x, n, p=0.4, alternative = 'greater',conf.level = .95)

Has the female proportion decreased ?

\( H_O: p=0.4 \)
\( H_A: p < 0.4 \)

prop.test(x, n, p=0.4, alternative = 'less',conf.level = .95)

Two-sided

Has the female proportion changed ?

\( H_O: p=0.4 \)
\( H_A: p \neq 0.4 \)

prop.test(x, n, p=0.4, alternative = 'two.sided',conf.level = .95)

Type I and Type II errors

  • There are two competing hypotheses: the null and the alternative.
  • In a hypothesis test, we make a decision about which might be true, but our choice might be incorrect.

  • A Type I Error is rejecting the null hypothesis when \( H_0 \) is true.

  • A Type II Error is failing to reject the null hypothesis when \( H_A \) is true.

Type I Error Rate

  • As a general rule we reject \( H_0 \) when the p-value is less than 0.05, i.e. we use a significance level of 0.05, () \( \alpha \) = 0.05).
  • This means that, for those cases where \( H_0 \) is actually true, we do not want to incorrectly reject it more than 5% of those times.
  • In other words, when using a 5% significance level there is about 5% chance of making a Type I error if the null hypothesis is true.
    P(Type I error | \( H_0 \) true) = \( \alpha \)
  • This is why we prefer small values of \( \alpha \) – increasing \( \alpha \) increases the Type I error rate.