Hypothesis Testing
In scientific research, we often want to prove that a certain statistical phenomenon is true. For example, we might want to prove (or rather show that it is likely) that a coin is unfair (that it lands on heads and tails with unequal probabilities). We do this by flipping the coin $N$ times and recording the number of heads $N_h$ and the number of tails $N_t$. For example, we may get $N_h = 40$ and $N_t = 60$, which sum to $N = 100$. How do we proceed to prove that the coin is unfair? We can do this by testing a hypothesis. A hypothesis is a statistical model of the system we are studying. In our examples, we might hypothesize that the coin is fair (that is, that it lands on heads and tails with equal probabilities). The idea is that we can calculate how likely it is that we would have gotten the observed data given that the hypothesis is true, using the hypothesised model. If it turns out to be very unlikely for us to observe the data we have observed, then we can reject the hypothesis.
Now let us make the hypothesis that the coin is fair. To calculate the observed data in the example of a coin flip, we can use the binomial distribution. The probability of observing $N_h$ heads and $N_t$ tails in $N$$ coin flips is given by the binomial distribution:
$$ P(N_h, N_t) = \frac{N!}{N_h! N_t!} p^{N_h} (1-p)^{N_t} $$where $p$ is the probability of observing heads in a single coin flip. Since we are hypothesising that the coin is fair, we have $p = 0.5$.
Plugging in the observed data, we get:
$$ P(40, 60) = \frac{100!}{40! 60!} 0.5^{40} 0.5^{60} \approx 0.0108 $$This is a fairly small probability. That means if we were to repeat the experiment many times, we would expect to observe the data we have observed only about 1% of the time. In hypothesis testing, we are not interested in the exact probability of the observed data, but rather in the probability of observing data that are as likely or even more unlikely than the observed data. Why? you might ask. This is a very good question that I do not have a good answer to at the moment. I will try to find out and update this post.
The probability of observing data that are as likely or even more unlikely than the observed data is defined as the p-value of the observed data under the hypothesis. The p-value of the example is:
$$ \begin{align*} \text{p-value} &= \sum_{N_h \leq 40} P(N_h, N_t = N - N_h) + \sum_{60 \leq N_h} P(N_h, N_t = N - N_h)k \\ &\approx 0.02844 + 0.02844 = 0.05688 \end{align*} $$This p-value of 0.05688 is the probability of observing data that are as likely or even more unlikely than the observed data ($N_h = 40$, $N_t = 60$) under the hypothesis that the coin is fair.
Usually, an observation with a p-value under a hypothesis of less than 0.05 is considered to be statistically significant, which means that we can reject the hypothesis.
Composite hypothesis
In the above example, we had only one hypothesis, namely that the coin is fair. What if we want to show that the coin has a greater probability of landing on heads than tails? We can do this by testing a composite hypothesis which can be collectively written as a set of hypotheses:
$$ \{H_p \mid p \leq 0.5\}. $$Now we calculate the p-value $\text{p-value}(H_p)$ of the observed data under each of these hypotheses $H_p$.
Because the goal of doing all these is to rule out the possibility that the coin lands on tails with a greater probability, we need every hypothesis $H_p$ to have a p-value less than the predetermined threshold, e.g. 0.5. Thus we can define the p-value of the composite hypothesis as the maximum of the p-values of the individual hypotheses:
$$ \text{p-value} = \max \text{p-value}(H_p) = \max \{H_p \mid p \leq 0.5\}. $$For some spaces of hypotheses, there might not be a maximum p-value. In these cases, we can define the p-value as the supremum of the p-values of the individual hypotheses:
$$ \text{p-value} = \sup \text{p-value}(H_p) = \sup \{H_p \mid p \leq 0.5\}. $$Beware that the maximum or supremum might not be easy to find.
Bootstrapping
In the above example, we have a complete analytical formula for the probability distribution of the observations. Well, what if we don’t? What if we have a complicated system that we cannot model analytically? Or worse, what if we have a gigantic composite null hypothesis which makes the maximum or supremum of the p-values of the individual hypotheses difficult to find?
One way would be to apply Monte Carlo to the problem. We can sample a large number of hypotheses from the space of composite hypotheses and calculate the p-value of each of the sampled hypotheses. However, sometimes it is not even easy to calculate the p-value of a single hypothesis.
If this is the case, we can forget about p-value and turn to confidence intervals. Usually, we calculate a number summarising the data, such as the mean or the standard deviation. Such a number is called sample statistic. This number tells you something interesting about the system you are studying. For example, the sample mean of coin flips reveals how biased the coin is. Suppose the sample mean is $0.4$. Does it mean that the coin is biased? Well we can use the p-value to answer this question. But there is another way to answer this question. Imagine we have the luxury of repeating the coin flip experiment many times. That way we can calculate a sample mean for each of the experiments and plot a histogram of the sample means. From the histogram, we can see how spread out the sample means are.
However, we do not usually have the luxury of repeating the experiment many times. Instead, we can sample from the data we already have to form a bootstrap sample. The bootstrap sample is a sample of the data that is the same size as the original data. Essentially, we are selecting data points from the original data with replacements. This way we can create out of thin air new data sets, with which we can calculate the sample statistic and plot a histogram of the sample statistic.