Is Everything... Normal?
Understanding How the Central Limit Theorem Simplifies Stats
Visit my GitHub page (link) to learn more about the python code that made this article possible.
Do you know what really grinds my gears, riles me up, bothers me, and frustrates me? It may sound dramatic, but my biggest pet peeve is when people are 100% certain about something, only to be proven wrong. Here are two statements that are, in my very firm opinion, entirely different:
"I am 100% sure the tennis courts close at 10PM, Mat"
"I am pretty sure the tennis courts close at 10PM, Mat"
Don't get me wrong, people are free to use whichever statement they like. However, if the first statement is used and the tennis courts actually close at 11PM, that person will lose all credibility, forever.
To maintain credibility, statisticians use various tools to make statements that are accurate and reliable. For example, it is not uncommon to hear things like the following:
A researcher is studying the effect of a new drug on blood pressure. They conduct a study on a sample of patients and find that the new drug reduces blood pressure by between 6 mmHg and 10 mmHg, 95% of the time.
A polling organization surveys a sample of voters to estimate the support for a political candidate. They find there is a 90% chance that the candidate will receive between 49% and 55% of the votes.
In this article, we will explore a key concept that enables statisticians to make precise and reliable statements like those mentioned above. Widely regarded as the backbone of inferential statistics, this concept is the Central Limit Theorem (CLT).
Background
So, what is the Central Limit Theorem, and why is it so important that it deserves to be the 3rd topic of this blog series? Here are some examples of what we would have to do if the CLT did not exist:
If we wanted to know which politician was ahead of the election, we would have to survey every single Canadian, instead of asking a smaller sample of people.
If we wanted to categorize a newly developed medication as safe or effective, we would have to test it on every single human being, instead of running a clinical trial and testing it on a smaller subset of people.
These situations would cost an absurd amount of money, and take an insane amount of time to conduct. So, how does the CLT allow us to bypass this? First, let's look at the definition from Investopedia:
In other words, the CLT gives us the ability to draw very firm conclusions about a population, by only doing analyses on a much smaller sample of that population!
To illustrate this point, in the sections below, we will create a population and try to come to conclusions about that population by only analyzing a sample of it.
Practical Example
Let's imagine a very simple world, where there are 1,000,000 different people. Each person is assigned a number between 0 and 10. So one person might have the number 2.1423, and another might have 6.3245. In the form of a table, we would have something like this:
Person | Value |
1 | 2.1423 |
2 | 6.3245 |
3 | 3.2345 |
... | ... |
999,999 | 4.4152 |
1,000,000 | 0.9412 |
If we were to plot the histogram representing this situation, we would obtain the following:
This histogram shows that there are about 100,000 people with values between 0 and 1, another 100,000 people with values between 1 and 2, and so on for each range. Since we currently have access to all the data that makes up the population, we can easily compute the true mean and standard deviation:
Type of Distribution | Uniform |
Mean | 4.999 |
Standard Deviation | 2.886 |
However, let's say we are interested in finding the mean, but with one limitation: we do not have access to the entire population. Therefore, we need to find a way to make a good guess at what the true average is, by only looking at a sample of the population.
One way we could do that is by following these simple steps:
Pick 100 people at random
Record the average of the group
Repeat steps 1-2 1000 times
After performing steps 1 to 3, we will be left with 1000 averages. We can plot each of these averages in another histogram, which looks like the following:
Let's take some time to understand what's going on here.
The distribution looks rather symmetric and resembles a bell curve
The average is close to 5.0.
We never see an average above 6, or below 4.2.
These 3 points all represent the Central Limit Theorem in action.
First Point: The CLT says that the distribution of the sample means will resemble a normal distribution (bell curve). This is great news since statisticians are very familiar with this type of distribution, and can therefore easily extract information from it.
Second Point: The CLT says that the distribution of the sample means will create a normal distribution around the true population mean. Since we already know that the true population mean is 4.999, we can see that the sampling distribution's mean is very close to it.
Third Point: The CLT states that as the sample size increases, the variance of the distribution of the sample mean becomes much smaller. To be more specific, the standard deviation of the distribution of the sample mean will be equal to the following (where s is the sample standard deviation, σ is the true population standard deviation, and n is the sample size):
$$s=\frac{\sigma}{\sqrt{n}}$$
This essentially says that as your sample size gets larger (n), the standard deviation of the sample (s) gets smaller, and so the distribution gets narrower and narrower around the true population mean. To illustrate this point, let's compare the same situation above, except with n=3, n=10, and n=100.
In the above graph, we can see that the three distributions are all centred around 5 (the true population mean). However, we can see that when the number of averages that we take (n) goes from 3 to 10 to 100, the distributions get narrower and narrower, and get a lot more concentrated around the true population mean.
Learnings
This is fantastic because we only had to analyze a much smaller portion of the true population data to figure out that we're pretty confident that the true population mean has to be somewhere near 5. But, how sure can we be? Can we be 100% sure that the average is 5? Or 95% sure? How SURE are we? Let's introduce one more concept: confidence intervals.
In this case, the unknown population parameter would be the mean. Let's create one more plot to help illustrate this point. Before creating the plot, we will create 100 histograms like the one above, with each histogram having a different value for n (from 1 to 100). We will then plot the estimated mean from each histogram as well as our confidence intervals.
When the sample size is low, we can see that the estimated mean is still rather close to 5.0 (as is shown in the solid green line), however, our confidence interval is pretty thick, from 3.5 to 6.5 (as is shown in the blue shaded area). This essentially means that we are 95% sure the true population mean is somewhere between 3.5 to 6.5. Unfortunately, this is quite a large gap and is not very meaningful. However, if we look at the confidence interval when n = 100, we are 95% sure that the true population mean must be somewhere between 4.92 and 5.08, which is a lot more precise than before! In practice, this means that the larger the sample size, the more precise our findings will be.
Another interesting thing to note is that as the sample size increases, the standard deviation initially drops significantly but then begins to plateau once the sample size becomes sufficiently large. This behaviour is explained by the formula shared above where the sample standard deviation is equal to the population standard deviation divided by the square root of the sample size. This relationship results in a curve that initially declines sharply and then levels off as the sample size continues to grow. From this, we can take that while increasing the sample size does improve the precision of our estimates, there are diminishing returns after a certain point. This means that beyond a certain sample size, the benefit of adding more data points becomes minimal.
Conclusion
In summary, the Central Limit Theorem is a statistical powerhouse that lets us make reliable conclusions about entire populations by examining just a small sample. This theorem ensures that, with a large enough sample size, our sample means will dance around the true population mean in a familiar bell-shaped curve.
So, next time someone tells you they are 100% certain about something, you can gently remind them of the beauty of the CLT and the importance of confidence intervals. After all, in statistics and in life, it's not just about being sure, but about knowing how sure you are.
By embracing the principles of the Central Limit Theorem, we can save time, money, and a whole lot of effort while maintaining credibility and making well-informed decisions. Now, armed with this knowledge, you can approach data analysis with confidence and precision, knowing that the CLT has got your back. And remember, always be a little skeptical of anyone who is 100% certain—they probably haven't met the CLT yet!
Visit my GitHub page (link) to learn more about the python code that made this article possible.