<![CDATA[Mathieu's Blog]]>https://mathieutorchia.comRSS for NodeSat, 02 Nov 2024 04:57:34 GMT60<![CDATA[The Cost of Simplification]]>https://mathieutorchia.com/the-cost-of-simplificationhttps://mathieutorchia.com/the-cost-of-simplificationMon, 21 Oct 2024 23:30:26 GMT<![CDATA[<p>Numbers are <strong>everywhere</strong>. Whether we hear them from U.S. presidential candidates, find them in articles during our searches, or see them in financial reports, they form a significant part of the information we consume. We live in an era where "Big Data" is everywhere, and the only way to make use of it is by simplifying it with data analytics. Its not just in todays world; we have always tried to express complex ideas in simpler ways. Im reminded of the time in high school when we were told</p><blockquote><p>You cannot take the square root of a negative number!</p></blockquote><p>Only to find out years later that this was a lie, and that the square root of -1 is equal to the imaginary number i. Sometimes, lies like that are needed. If they tried to teach us everything about math during those years, without cutting corners, we would have been completely overwhelmed. However, there are cons when simplifying things. In this article, we'll explore a straightforward example that vividly illustrates how oversimplifying data insights can lead to misleading conclusions.</p><hr /><p>As you know (or, in case you didnt know), there are many reasons that I started this blog: one being to improve my coding and visualization skills. With that being said, lets take a look at a seemingly busy (but colourful) scatter plot.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729539568228/f9c5ce90-f2f2-4ed7-8faa-ba1cbe043c75.png" alt class="image--center mx-auto" /></p><p>At first glance, this may seem like a random collection of points in a graph. However, after paying attention to each colour, its clear that they all follow some sort of pattern:</p><ul><li><p>The blue dots are following an upwards trend</p></li><li><p>The red squares seem to be forming an arc</p></li><li><p>The green triangles mostly follow a straight upwards line</p></li><li><p>The yellow diamonds almost all have the same x value of 8.</p></li></ul><p>Nonetheless, they are all clearly very different from one another.</p><p>In practise, there are not many useful data sets out there that only have 44 data points Data sets can have thousands, millions, and even billions of rows of data, which makes visualizing the data often impossible or extremely difficult to read. Due to this, data analysts will try to simplify the data with measures such as the average, the standard deviation, the coefficient of correlation, etc. But, <strong>is there a cost to these simplifications</strong>? Lets dive a bit deeper into the blue dots to help answer that question.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729539580440/ed2f25a8-c4e3-417d-87a8-66550f7d58a1.png" alt class="image--center mx-auto" /></p><p>In the graph above, we are plotting only the blue points. Additionally, we added a line that best fits the data (a linear OLS regression), and some common descriptive statistics at the bottom right hand side of the graph. From here, without even plotting the dots, we can get access to a lot of interesting information:</p><ul><li><p>The average y value is 7.50</p></li><li><p>The standard deviation of the y values is 1.94</p></li><li><p>x and y have a correlation coefficient of 0.82</p></li><li><p>The line of best fit follows the following equation:</p></li></ul><p>$$y=0.5x+3.0$$</p><p>Great! We've gained a lot of insights from this data by finding key statistics like the average and the correlation coefficient. We can now use these in our discussions or when forming opinions on certain topics, without needing to spend more time exploring the data or checking for anything potentially misleading.</p><p>Right?</p><p> Right ?</p><p>Unfortunately, no.</p><p>While simplifying data with key statistics like those above is very useful, it's also important to be cautious. These simplifications can sometimes hide important details or lead to incorrect conclusions. To illustrate this point, let's plot the same graph as above, but this time include all the different data points from the first graph.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729539593539/859355da-5add-4bfe-9d0e-08cf11cfbbfd.png" alt class="image--center mx-auto" /></p><p>As shown above, if we draw the line of best fit for each data set, as well as the mean, standard deviation, and coefficient of correlation, <strong>we get the exact same result</strong>. This is called Anscombes quartet, which was constructed in 1973 by Francis Anscombe:</p><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text"><strong>Anscombe's quartet</strong> comprises four datasets that have nearly identical simple <strong>descriptive statistics</strong>, yet have very different <strong>distributions</strong> and appear very different when graphed.</div></div><p>I was amazed by this illustration, and had to post an article about it. It's one of those things that can be understood just by looking at the picture above, without needing many words to explain it.</p><p>The key takeaway here is to be careful whenever we attempt to simplify <strong>anything</strong>. There are benefits to simplification, but to enjoy these advantages, it's important to examine the raw data first to make sure nothing unusual is overlooked. This is more philosophical than mathematical: its always better to get information from the source than from a (potentially broken) telephone. If the source is unavailable, then make sure the telephone is a reliable one.</p><h1 id="heading-conclusion">Conclusion</h1><p>To wrap up, this example resonates beyond mathematics and data analysis; it extends to how we approach real-world issues. I was reminded of this during a talk at the ALLIN AI conference in Montreal last September, where a representative from the "Conseil du statut de la femme" discussed AI's potential risks concerning gender equality. When I asked about her thoughts on mandating companies to hire equal numbers of men and women, she highlighted a critical flaw: while a 50/50 gender ratio may appear balanced on the surface, it may obscure deeper issuessuch as women being concentrated in lower-growth roles or positions with limited decision-making power.</p><p>The lesson here is clear: <strong>simplified metrics can paint an incomplete picture</strong>. Whether analyzing data or tackling social challenges, we must look beyond surface-level statistics to ensure were not missing important details.</p><blockquote><p><strong><em>Visit my GitHub page (</em></strong><a target="_blank" href="https://github.com/mathieutorchia/anscombe-quartet"><strong><em>link</em></strong></a><strong><em>) learn more about the python code that made this article possible.</em></strong></p></blockquote>]]><![CDATA[<p>Numbers are <strong>everywhere</strong>. Whether we hear them from U.S. presidential candidates, find them in articles during our searches, or see them in financial reports, they form a significant part of the information we consume. We live in an era where "Big Data" is everywhere, and the only way to make use of it is by simplifying it with data analytics. Its not just in todays world; we have always tried to express complex ideas in simpler ways. Im reminded of the time in high school when we were told</p><blockquote><p>You cannot take the square root of a negative number!</p></blockquote><p>Only to find out years later that this was a lie, and that the square root of -1 is equal to the imaginary number i. Sometimes, lies like that are needed. If they tried to teach us everything about math during those years, without cutting corners, we would have been completely overwhelmed. However, there are cons when simplifying things. In this article, we'll explore a straightforward example that vividly illustrates how oversimplifying data insights can lead to misleading conclusions.</p><hr /><p>As you know (or, in case you didnt know), there are many reasons that I started this blog: one being to improve my coding and visualization skills. With that being said, lets take a look at a seemingly busy (but colourful) scatter plot.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729539568228/f9c5ce90-f2f2-4ed7-8faa-ba1cbe043c75.png" alt class="image--center mx-auto" /></p><p>At first glance, this may seem like a random collection of points in a graph. However, after paying attention to each colour, its clear that they all follow some sort of pattern:</p><ul><li><p>The blue dots are following an upwards trend</p></li><li><p>The red squares seem to be forming an arc</p></li><li><p>The green triangles mostly follow a straight upwards line</p></li><li><p>The yellow diamonds almost all have the same x value of 8.</p></li></ul><p>Nonetheless, they are all clearly very different from one another.</p><p>In practise, there are not many useful data sets out there that only have 44 data points Data sets can have thousands, millions, and even billions of rows of data, which makes visualizing the data often impossible or extremely difficult to read. Due to this, data analysts will try to simplify the data with measures such as the average, the standard deviation, the coefficient of correlation, etc. But, <strong>is there a cost to these simplifications</strong>? Lets dive a bit deeper into the blue dots to help answer that question.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729539580440/ed2f25a8-c4e3-417d-87a8-66550f7d58a1.png" alt class="image--center mx-auto" /></p><p>In the graph above, we are plotting only the blue points. Additionally, we added a line that best fits the data (a linear OLS regression), and some common descriptive statistics at the bottom right hand side of the graph. From here, without even plotting the dots, we can get access to a lot of interesting information:</p><ul><li><p>The average y value is 7.50</p></li><li><p>The standard deviation of the y values is 1.94</p></li><li><p>x and y have a correlation coefficient of 0.82</p></li><li><p>The line of best fit follows the following equation:</p></li></ul><p>$$y=0.5x+3.0$$</p><p>Great! We've gained a lot of insights from this data by finding key statistics like the average and the correlation coefficient. We can now use these in our discussions or when forming opinions on certain topics, without needing to spend more time exploring the data or checking for anything potentially misleading.</p><p>Right?</p><p> Right ?</p><p>Unfortunately, no.</p><p>While simplifying data with key statistics like those above is very useful, it's also important to be cautious. These simplifications can sometimes hide important details or lead to incorrect conclusions. To illustrate this point, let's plot the same graph as above, but this time include all the different data points from the first graph.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729539593539/859355da-5add-4bfe-9d0e-08cf11cfbbfd.png" alt class="image--center mx-auto" /></p><p>As shown above, if we draw the line of best fit for each data set, as well as the mean, standard deviation, and coefficient of correlation, <strong>we get the exact same result</strong>. This is called Anscombes quartet, which was constructed in 1973 by Francis Anscombe:</p><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text"><strong>Anscombe's quartet</strong> comprises four datasets that have nearly identical simple <strong>descriptive statistics</strong>, yet have very different <strong>distributions</strong> and appear very different when graphed.</div></div><p>I was amazed by this illustration, and had to post an article about it. It's one of those things that can be understood just by looking at the picture above, without needing many words to explain it.</p><p>The key takeaway here is to be careful whenever we attempt to simplify <strong>anything</strong>. There are benefits to simplification, but to enjoy these advantages, it's important to examine the raw data first to make sure nothing unusual is overlooked. This is more philosophical than mathematical: its always better to get information from the source than from a (potentially broken) telephone. If the source is unavailable, then make sure the telephone is a reliable one.</p><h1 id="heading-conclusion">Conclusion</h1><p>To wrap up, this example resonates beyond mathematics and data analysis; it extends to how we approach real-world issues. I was reminded of this during a talk at the ALLIN AI conference in Montreal last September, where a representative from the "Conseil du statut de la femme" discussed AI's potential risks concerning gender equality. When I asked about her thoughts on mandating companies to hire equal numbers of men and women, she highlighted a critical flaw: while a 50/50 gender ratio may appear balanced on the surface, it may obscure deeper issuessuch as women being concentrated in lower-growth roles or positions with limited decision-making power.</p><p>The lesson here is clear: <strong>simplified metrics can paint an incomplete picture</strong>. Whether analyzing data or tackling social challenges, we must look beyond surface-level statistics to ensure were not missing important details.</p><blockquote><p><strong><em>Visit my GitHub page (</em></strong><a target="_blank" href="https://github.com/mathieutorchia/anscombe-quartet"><strong><em>link</em></strong></a><strong><em>) learn more about the python code that made this article possible.</em></strong></p></blockquote>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1729523243477/fe992c71-d89c-4265-bc2e-33f9e8549c5a.jpeg<![CDATA[Is Everything... Normal?]]>https://mathieutorchia.com/is-everything-normalhttps://mathieutorchia.com/is-everything-normalMon, 05 Aug 2024 03:55:47 GMT<![CDATA[<blockquote><p><strong><em>Visit my GitHub page (</em></strong><a target="_blank" href="https://github.com/mathieutorchia/central-limit-theorem"><strong><em>link</em></strong></a><a target="_blank" href="https://github.com/mathieutorchia/SP500-Investing-Project"><strong><em>)</em></strong></a> <strong><em>to learn more about the python code that made this article possible.</em></strong></p></blockquote><p>Do you know what really grinds my gears, riles me up, bothers me, and frustrates me? It may sound dramatic, but my biggest pet peeve is when people are 100% certain about something, only to be proven wrong. Here are two statements that are, in my very firm opinion, entirely different:</p><ul><li><p>"I am 100% sure the tennis courts close at 10PM, Mat"</p></li><li><p>"I am <em>pretty sure</em> the tennis courts close at 10PM, Mat"</p></li></ul><p>Don't get me wrong, people are free to use whichever statement they like. However, if the first statement is used and the tennis courts actually close at <em>11PM</em>, that person will lose all credibility, <strong>forever</strong>.</p><p>To maintain credibility, statisticians use various tools to make statements that are accurate and reliable. For example, it is not uncommon to hear things like the following:</p><ul><li><p>A researcher is studying the effect of a new drug on blood pressure. They conduct a study on a sample of patients and find that the new drug reduces blood pressure by between 6 mmHg and 10 mmHg, 95% of the time.</p></li><li><p>A polling organization surveys a sample of voters to estimate the support for a political candidate. They find there is a 90% chance that the candidate will receive between 49% and 55% of the votes.</p></li></ul><p>In this article, we will explore a key concept that enables statisticians to make precise and reliable statements like those mentioned above. Widely regarded as the backbone of inferential statistics, this concept is the <strong>Central Limit Theorem</strong> (CLT).</p><h3 id="heading-background">Background</h3><p>So, <em>what is the Central Limit Theorem, and why is it so important that it deserves to be the 3rd topic of this blog series?</em> Here are some examples of what we would have to do if the CLT <strong>did not exist</strong>:</p><ul><li><p>If we wanted to know which politician was ahead of the election, we would have to survey every single Canadian, instead of asking a smaller sample of people.</p></li><li><p>If we wanted to categorize a newly developed medication as <em>safe</em> or <em>effective</em>, we would have to test it on every single human being, instead of running a clinical trial and testing it on a smaller subset of people.</p></li></ul><p>These situations would cost an absurd amount of money, and take an insane amount of time to conduct. So, how does the CLT allow us to bypass this? First, let's look at the definition from <a target="_blank" href="https://www.investopedia.com/terms/c/central_limit_theorem.asp">Investopedia</a>:</p><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text">The <strong>Central Limit Theorem (CLT)</strong> is a statistical premise that, given a sufficiently large sample size from a population with a finite level of variance, the <strong>mean of all sampled variables</strong> from the same population will be <strong>approximately equal to the mean of the whole population</strong>.</div></div><p>In other words, t<strong>he CLT gives us the ability to draw very firm conclusions about a population, by only doing analyses on a much smaller sample of that population!</strong></p><p>To illustrate this point, in the sections below, we will create a population and try to come to conclusions about that population by only analyzing a sample of it.</p><h3 id="heading-practical-example">Practical Example</h3><p>Let's imagine a very simple world, where there are 1,000,000 different people. Each person is assigned a number between 0 and 10. So one person might have the number 2.1423, and another might have 6.3245. In the form of a table, we would have something like this:</p><div class="hn-table"><table><thead><tr><td>Person</td><td>Value</td></tr></thead><tbody><tr><td>1</td><td>2.1423</td></tr><tr><td>2</td><td>6.3245</td></tr><tr><td>3</td><td>3.2345</td></tr><tr><td>...</td><td>...</td></tr><tr><td>999,999</td><td>4.4152</td></tr><tr><td>1,000,000</td><td>0.9412</td></tr></tbody></table></div><p>If we were to plot the histogram representing this situation, we would obtain the following:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722694728286/7f74d939-d059-4a52-942d-948060d9d087.png" alt class="image--center mx-auto" /></p><p>This histogram shows that there are about 100,000 people with values between 0 and 1, another 100,000 people with values between 1 and 2, and so on for each range. Since we currently have access to all the data that makes up the population, we can easily compute the true mean and standard deviation:</p><div class="hn-table"><table><thead><tr><td><strong>Type of Distribution</strong></td><td>Uniform</td></tr></thead><tbody><tr><td><strong>Mean</strong></td><td>4.999</td></tr><tr><td><strong>Standard Deviation</strong></td><td>2.886</td></tr></tbody></table></div><p>However, let's say we are interested in finding the mean, but with one limitation: <strong>we do not have access to the entire population</strong>. Therefore, we need to find a way to make a good guess at what the true average is, by only looking at a <em>sample</em> of the population.</p><p>One way we could do that is by following these simple steps:</p><ol><li><p>Pick 100 people at random</p></li><li><p>Record the average of the group</p></li><li><p>Repeat steps 1-2 1000 times</p></li></ol><p>After performing steps 1 to 3, we will be left with 1000 averages. We can plot each of these averages in another histogram, which looks like the following:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722694756042/cbc625fd-1336-491c-b186-850060754937.png" alt class="image--center mx-auto" /></p><p>Let's take some time to understand what's going on here.</p><ol><li><p>The distribution looks rather symmetric and resembles a bell curve</p></li><li><p>The average is close to 5.0.</p></li><li><p>We never see an average above 6, or below 4.2.</p></li></ol><p>These 3 points all represent the Central Limit Theorem in action.</p><p><strong>First Point:</strong> The CLT says that the distribution of the sample means will resemble a normal distribution (bell curve). This is great news since statisticians are very familiar with this type of distribution, and can therefore easily extract information from it.</p><p><strong>Second Point:</strong> The CLT says that the distribution of the sample means will create a normal distribution around the true population mean. Since we already know that the true population mean is 4.999, we can see that the sampling distribution's mean is very close to it.</p><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text">In fact, the CLT states that (given a few requirements) the distribution of the sample means will always form a normal distribution, <strong>regardless of the initial distribution</strong>! In this example, the initial distribution was uniform, but it could be anything (as long as it has finite variance). This is another incredible property of the CLT.</div></div><p><strong>Third Point:</strong> The CLT states that as the sample size increases, the variance of the distribution of the sample mean becomes much smaller. To be more specific, the standard deviation of the distribution of the sample mean will be equal to the following (where <em>s</em> is the <strong>sample</strong> standard deviation, <em></em> is the true <strong>population</strong> standard deviation, and <em>n</em> is the <strong>sample size</strong>):</p><p>$$s=\frac{\sigma}{\sqrt{n}}$$</p><p>This essentially says that as your sample size gets larger (<em>n</em>), the standard deviation of the sample (<em>s</em>) gets smaller, and so the distribution gets narrower and narrower around the true population mean. To illustrate this point, let's compare the same situation above, except with <em>n=3</em>, <em>n=10</em>, and <em>n=100.</em></p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722694764951/aa4ea4d0-c239-4850-95dc-60ccadc30d1d.png" alt class="image--center mx-auto" /></p><p>In the above graph, we can see that the three distributions are all centred around 5 (the true population mean). However, we can see that when the number of averages that we take (<em>n</em>) goes from 3 to 10 to 100, the distributions get narrower and narrower, and get a lot more concentrated around the true population mean.</p><h3 id="heading-learnings">Learnings</h3><p>This is fantastic because we only had to analyze a much smaller portion of the true population data to figure out that we're pretty confident that the true population mean has to be somewhere near 5. But, how sure can we be? Can we be 100% sure that the average is 5? Or 95% sure? How SURE are we? Let's introduce one more concept: confidence intervals.</p><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text">A confidence interval is <strong>a range of values, bounded above and below the statistic's mean</strong>, that likely would contain an <em>unknown population parameter</em>. (Source: <a target="_blank" href="https://www.investopedia.com/terms/c/confidenceinterval.asp">Investopedia</a>)</div></div><p>In this case, the <em>unknown population parameter</em> would be the mean. Let's create one more plot to help illustrate this point. Before creating the plot, we will create 100 histograms like the one above, with each histogram having a different value for <em>n</em> (from 1 to 100). We will then plot the estimated mean from each histogram as well as our confidence intervals.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722694776120/442990c1-3e45-4f8c-a5f2-dcef34be57e3.png" alt class="image--center mx-auto" /></p><p>When the sample size is low, we can see that the estimated mean is still rather close to 5.0 (as is shown in the solid green line), however, our confidence interval is pretty <em>thick</em>, from 3.5 to 6.5 (as is shown in the blue shaded area). This essentially means that we are 95% sure the true population mean is somewhere between 3.5 to 6.5. Unfortunately, <strong>this is quite a large gap</strong> and is <strong>not very meaningful</strong>. However, if we look at the confidence interval when <em>n</em> = 100, we are 95% sure that the true population mean must be somewhere between 4.92 and 5.08, which is <strong>a lot more precise than before</strong>! In practice, this means that the larger the sample size, the more precise our findings will be.</p><p>Another interesting thing to note is that <strong>as the sample size increases, the standard deviation initially drops significantly but then begins to plateau once the sample size becomes sufficiently large</strong>. This behaviour is explained by the formula shared above where the sample standard deviation is equal to the population standard deviation <strong>divided by the square root of the sample size</strong>. This relationship results in a curve that initially declines sharply and then levels off as the sample size continues to grow. From this, we can take that while increasing the sample size does improve the precision of our estimates, there are <strong>diminishing returns after a certain point</strong>. This means that beyond a certain sample size, the benefit of adding more data points becomes minimal.</p><h3 id="heading-conclusion">Conclusion</h3><p>In summary, the Central Limit Theorem is a statistical powerhouse that lets us make reliable conclusions about entire populations by examining just a small sample. This theorem ensures that, with a large enough sample size, our sample means will dance around the true population mean in a familiar bell-shaped curve.</p><p>So, next time someone tells you they are 100% certain about something, you can <em>gently</em> remind them of the beauty of the CLT and the importance of confidence intervals. After all, in statistics and in life, it's not just about being sure, but about knowing how sure you are.</p><p>By embracing the principles of the Central Limit Theorem, we can save time, money, and a whole lot of effort while maintaining credibility and making well-informed decisions. Now, armed with this knowledge, you can approach data analysis with confidence and precision, knowing that the CLT has got your back. And remember, always be a little skeptical of anyone who is 100% certainthey probably haven't met the CLT yet!</p><blockquote><p><strong><em>Visit my GitHub page (</em></strong><a target="_blank" href="https://github.com/mathieutorchia/central-limit-theorem"><strong><em>link</em></strong></a><a target="_blank" href="https://github.com/mathieutorchia/SP500-Investing-Project"><strong><em>)</em></strong></a> <strong><em>to learn more about the python code that made this article possible.</em></strong></p></blockquote>]]><![CDATA[<blockquote><p><strong><em>Visit my GitHub page (</em></strong><a target="_blank" href="https://github.com/mathieutorchia/central-limit-theorem"><strong><em>link</em></strong></a><a target="_blank" href="https://github.com/mathieutorchia/SP500-Investing-Project"><strong><em>)</em></strong></a> <strong><em>to learn more about the python code that made this article possible.</em></strong></p></blockquote><p>Do you know what really grinds my gears, riles me up, bothers me, and frustrates me? It may sound dramatic, but my biggest pet peeve is when people are 100% certain about something, only to be proven wrong. Here are two statements that are, in my very firm opinion, entirely different:</p><ul><li><p>"I am 100% sure the tennis courts close at 10PM, Mat"</p></li><li><p>"I am <em>pretty sure</em> the tennis courts close at 10PM, Mat"</p></li></ul><p>Don't get me wrong, people are free to use whichever statement they like. However, if the first statement is used and the tennis courts actually close at <em>11PM</em>, that person will lose all credibility, <strong>forever</strong>.</p><p>To maintain credibility, statisticians use various tools to make statements that are accurate and reliable. For example, it is not uncommon to hear things like the following:</p><ul><li><p>A researcher is studying the effect of a new drug on blood pressure. They conduct a study on a sample of patients and find that the new drug reduces blood pressure by between 6 mmHg and 10 mmHg, 95% of the time.</p></li><li><p>A polling organization surveys a sample of voters to estimate the support for a political candidate. They find there is a 90% chance that the candidate will receive between 49% and 55% of the votes.</p></li></ul><p>In this article, we will explore a key concept that enables statisticians to make precise and reliable statements like those mentioned above. Widely regarded as the backbone of inferential statistics, this concept is the <strong>Central Limit Theorem</strong> (CLT).</p><h3 id="heading-background">Background</h3><p>So, <em>what is the Central Limit Theorem, and why is it so important that it deserves to be the 3rd topic of this blog series?</em> Here are some examples of what we would have to do if the CLT <strong>did not exist</strong>:</p><ul><li><p>If we wanted to know which politician was ahead of the election, we would have to survey every single Canadian, instead of asking a smaller sample of people.</p></li><li><p>If we wanted to categorize a newly developed medication as <em>safe</em> or <em>effective</em>, we would have to test it on every single human being, instead of running a clinical trial and testing it on a smaller subset of people.</p></li></ul><p>These situations would cost an absurd amount of money, and take an insane amount of time to conduct. So, how does the CLT allow us to bypass this? First, let's look at the definition from <a target="_blank" href="https://www.investopedia.com/terms/c/central_limit_theorem.asp">Investopedia</a>:</p><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text">The <strong>Central Limit Theorem (CLT)</strong> is a statistical premise that, given a sufficiently large sample size from a population with a finite level of variance, the <strong>mean of all sampled variables</strong> from the same population will be <strong>approximately equal to the mean of the whole population</strong>.</div></div><p>In other words, t<strong>he CLT gives us the ability to draw very firm conclusions about a population, by only doing analyses on a much smaller sample of that population!</strong></p><p>To illustrate this point, in the sections below, we will create a population and try to come to conclusions about that population by only analyzing a sample of it.</p><h3 id="heading-practical-example">Practical Example</h3><p>Let's imagine a very simple world, where there are 1,000,000 different people. Each person is assigned a number between 0 and 10. So one person might have the number 2.1423, and another might have 6.3245. In the form of a table, we would have something like this:</p><div class="hn-table"><table><thead><tr><td>Person</td><td>Value</td></tr></thead><tbody><tr><td>1</td><td>2.1423</td></tr><tr><td>2</td><td>6.3245</td></tr><tr><td>3</td><td>3.2345</td></tr><tr><td>...</td><td>...</td></tr><tr><td>999,999</td><td>4.4152</td></tr><tr><td>1,000,000</td><td>0.9412</td></tr></tbody></table></div><p>If we were to plot the histogram representing this situation, we would obtain the following:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722694728286/7f74d939-d059-4a52-942d-948060d9d087.png" alt class="image--center mx-auto" /></p><p>This histogram shows that there are about 100,000 people with values between 0 and 1, another 100,000 people with values between 1 and 2, and so on for each range. Since we currently have access to all the data that makes up the population, we can easily compute the true mean and standard deviation:</p><div class="hn-table"><table><thead><tr><td><strong>Type of Distribution</strong></td><td>Uniform</td></tr></thead><tbody><tr><td><strong>Mean</strong></td><td>4.999</td></tr><tr><td><strong>Standard Deviation</strong></td><td>2.886</td></tr></tbody></table></div><p>However, let's say we are interested in finding the mean, but with one limitation: <strong>we do not have access to the entire population</strong>. Therefore, we need to find a way to make a good guess at what the true average is, by only looking at a <em>sample</em> of the population.</p><p>One way we could do that is by following these simple steps:</p><ol><li><p>Pick 100 people at random</p></li><li><p>Record the average of the group</p></li><li><p>Repeat steps 1-2 1000 times</p></li></ol><p>After performing steps 1 to 3, we will be left with 1000 averages. We can plot each of these averages in another histogram, which looks like the following:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722694756042/cbc625fd-1336-491c-b186-850060754937.png" alt class="image--center mx-auto" /></p><p>Let's take some time to understand what's going on here.</p><ol><li><p>The distribution looks rather symmetric and resembles a bell curve</p></li><li><p>The average is close to 5.0.</p></li><li><p>We never see an average above 6, or below 4.2.</p></li></ol><p>These 3 points all represent the Central Limit Theorem in action.</p><p><strong>First Point:</strong> The CLT says that the distribution of the sample means will resemble a normal distribution (bell curve). This is great news since statisticians are very familiar with this type of distribution, and can therefore easily extract information from it.</p><p><strong>Second Point:</strong> The CLT says that the distribution of the sample means will create a normal distribution around the true population mean. Since we already know that the true population mean is 4.999, we can see that the sampling distribution's mean is very close to it.</p><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text">In fact, the CLT states that (given a few requirements) the distribution of the sample means will always form a normal distribution, <strong>regardless of the initial distribution</strong>! In this example, the initial distribution was uniform, but it could be anything (as long as it has finite variance). This is another incredible property of the CLT.</div></div><p><strong>Third Point:</strong> The CLT states that as the sample size increases, the variance of the distribution of the sample mean becomes much smaller. To be more specific, the standard deviation of the distribution of the sample mean will be equal to the following (where <em>s</em> is the <strong>sample</strong> standard deviation, <em></em> is the true <strong>population</strong> standard deviation, and <em>n</em> is the <strong>sample size</strong>):</p><p>$$s=\frac{\sigma}{\sqrt{n}}$$</p><p>This essentially says that as your sample size gets larger (<em>n</em>), the standard deviation of the sample (<em>s</em>) gets smaller, and so the distribution gets narrower and narrower around the true population mean. To illustrate this point, let's compare the same situation above, except with <em>n=3</em>, <em>n=10</em>, and <em>n=100.</em></p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722694764951/aa4ea4d0-c239-4850-95dc-60ccadc30d1d.png" alt class="image--center mx-auto" /></p><p>In the above graph, we can see that the three distributions are all centred around 5 (the true population mean). However, we can see that when the number of averages that we take (<em>n</em>) goes from 3 to 10 to 100, the distributions get narrower and narrower, and get a lot more concentrated around the true population mean.</p><h3 id="heading-learnings">Learnings</h3><p>This is fantastic because we only had to analyze a much smaller portion of the true population data to figure out that we're pretty confident that the true population mean has to be somewhere near 5. But, how sure can we be? Can we be 100% sure that the average is 5? Or 95% sure? How SURE are we? Let's introduce one more concept: confidence intervals.</p><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text">A confidence interval is <strong>a range of values, bounded above and below the statistic's mean</strong>, that likely would contain an <em>unknown population parameter</em>. (Source: <a target="_blank" href="https://www.investopedia.com/terms/c/confidenceinterval.asp">Investopedia</a>)</div></div><p>In this case, the <em>unknown population parameter</em> would be the mean. Let's create one more plot to help illustrate this point. Before creating the plot, we will create 100 histograms like the one above, with each histogram having a different value for <em>n</em> (from 1 to 100). We will then plot the estimated mean from each histogram as well as our confidence intervals.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722694776120/442990c1-3e45-4f8c-a5f2-dcef34be57e3.png" alt class="image--center mx-auto" /></p><p>When the sample size is low, we can see that the estimated mean is still rather close to 5.0 (as is shown in the solid green line), however, our confidence interval is pretty <em>thick</em>, from 3.5 to 6.5 (as is shown in the blue shaded area). This essentially means that we are 95% sure the true population mean is somewhere between 3.5 to 6.5. Unfortunately, <strong>this is quite a large gap</strong> and is <strong>not very meaningful</strong>. However, if we look at the confidence interval when <em>n</em> = 100, we are 95% sure that the true population mean must be somewhere between 4.92 and 5.08, which is <strong>a lot more precise than before</strong>! In practice, this means that the larger the sample size, the more precise our findings will be.</p><p>Another interesting thing to note is that <strong>as the sample size increases, the standard deviation initially drops significantly but then begins to plateau once the sample size becomes sufficiently large</strong>. This behaviour is explained by the formula shared above where the sample standard deviation is equal to the population standard deviation <strong>divided by the square root of the sample size</strong>. This relationship results in a curve that initially declines sharply and then levels off as the sample size continues to grow. From this, we can take that while increasing the sample size does improve the precision of our estimates, there are <strong>diminishing returns after a certain point</strong>. This means that beyond a certain sample size, the benefit of adding more data points becomes minimal.</p><h3 id="heading-conclusion">Conclusion</h3><p>In summary, the Central Limit Theorem is a statistical powerhouse that lets us make reliable conclusions about entire populations by examining just a small sample. This theorem ensures that, with a large enough sample size, our sample means will dance around the true population mean in a familiar bell-shaped curve.</p><p>So, next time someone tells you they are 100% certain about something, you can <em>gently</em> remind them of the beauty of the CLT and the importance of confidence intervals. After all, in statistics and in life, it's not just about being sure, but about knowing how sure you are.</p><p>By embracing the principles of the Central Limit Theorem, we can save time, money, and a whole lot of effort while maintaining credibility and making well-informed decisions. Now, armed with this knowledge, you can approach data analysis with confidence and precision, knowing that the CLT has got your back. And remember, always be a little skeptical of anyone who is 100% certainthey probably haven't met the CLT yet!</p><blockquote><p><strong><em>Visit my GitHub page (</em></strong><a target="_blank" href="https://github.com/mathieutorchia/central-limit-theorem"><strong><em>link</em></strong></a><a target="_blank" href="https://github.com/mathieutorchia/SP500-Investing-Project"><strong><em>)</em></strong></a> <strong><em>to learn more about the python code that made this article possible.</em></strong></p></blockquote>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1722694270612/07bfe787-2e17-4618-9989-5d867807b6b3.png<![CDATA[Exponential Growth: Investing and Inflation]]>https://mathieutorchia.com/exponential-growth-investing-and-inflationhttps://mathieutorchia.com/exponential-growth-investing-and-inflationMon, 01 Jul 2024 16:00:55 GMT<![CDATA[<blockquote><p>Visit my GitHub page (<a target="_blank" href="https://github.com/mathieutorchia/SP500-Investing-Project">link</a>) to learn more about the python code that made this article possible.</p></blockquote><p>Isn't exponential growth a mind-blowing concept? I remember first learning about it and not thinking much of it. However, the more I thought about it, the more I realized how common it is around us, and how powerful it can be. It's easy to say things like: "<em>Only 42 people have COVID in the United States, what's the big deal?</em> "or "<em>Sure I'll sign the mortgage, the 5% fixed interest rate is not that bad." or "I'll keep my money under my mattress, I don't want to risk it in the stock market"</em>. At first glance, these statements seem reasonable. However, when we consider their compounding nature, they take on a completely different meaning:</p><ul><li><p>In March 2020 (in the US), the number of COVID cases jumped from 42 at the start of the month to 185,000 by the end (a 4400x increase!).</p></li><li><p>Even if the fixed interest rate is set at 5%, mortgages usually last for many decades. This means that 5% is compounded yearly and could end up growing to over 100% of the home's value.</p></li><li><p>If you kept your money under your mattress from 1983 to 2024, it would be worth a third of its initial value (in real terms).</p></li></ul><p>While there are countless examples of exponential growth that affect us in our lives, we will be focusing on it in the context of investing in the stock market. In this article, <strong>we will explore the long-term effects of periodically investing in the stock market</strong>, as well as <strong>how inflation affects your bottom line</strong>.</p><h3 id="heading-inflation">Inflation</h3><p>Before getting into the importance/benefits of investing in the stock market, let's explore the <em>worst</em> thing that was ever invented: inflation.</p><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text"><strong>Inflation</strong> is the term used to describe how the money we have today will be worth less in the future.</div></div><p>One of the ways we track inflation is by looking at the consumer price index (CPI), which <strong>measures the average change over time in the prices paid by consumers for a basket of goods and services</strong>. For example, if inflation was 3% last year, this means that, on average, something that used to cost $100 will now cost $103. In other words, your money is losing value... every. day.</p><p>So, how bad is it? We can plot the CPI in the United States from 1947 to 2024 to help answer that question.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1719626639204/c1158a85-9580-4a36-a607-66f4ccdc87c3.png" alt class="image--center mx-auto" /></p><p>This shows the CPI to be 100 in 1983, in contrast to roughly 300 in 2024. This means that $100 of typical expenses in 1983 would cost 3x that amount in 2024!</p><p>At first, this is terrifying. Thankfully, there is a way to bypass some of the negative impacts that come with inflation, and that is for investors to place their money somewhere it will appreciate in value: real estate, private lending, <s>loansharking</s>, or investing in the stock market.</p><h3 id="heading-the-stock-market">The Stock Market</h3><p>There are many ways to invest in the stock market. For the purpose of this article, we will be looking at investing in an index fund (like <code>TSE: XSP</code>) that mimics the Standard & Poor's 500. The S&P 500 tracks the performance of the 500 largest companies in the United States. In fact, these top 500 companies make up roughly <strong>80% of the total U.S. equity market capitalization</strong> (source: <a target="_blank" href="https://www.morningstar.ca/ca/news/185437/sp-500-or-total-stock-market-index-for-us-exposure.aspx#:~:text=Stocks%20in%20the%20S%26P%20500,the%20presence%20of%20smaller%20stocks.">Morning Star</a>). Investing in an index fund that includes all the companies in the S&P 500 is common advice, even from people like Warren Buffet:</p><blockquote><p>"In my view, for most people, the best thing to do is own the S&P 500 index fund. The trick is not to pick the right company. The trick is to essentially buy all the big companies through the S&P 500 and to do it consistently and to do it in a very, very low-cost way".</p></blockquote><p>Now, let's look into the outcome of investing in the S&P 500 for 40 years, from 1983 to the end of 2023.</p><h3 id="heading-investing-for-40-years-increasing-monthly-investments">Investing for 40 Years - Increasing Monthly Investments</h3><p>Let's imagine an investor (we'll call her Sabrina) who is about to embark on a lifetime of textbook S&P investing:</p><ul><li><p>She has purchased $50,000 worth of shares (at the start of 1983).</p></li><li><p>She is willing to invest an additional $2,000 per month from 1983 to 2023.</p></li><li><p>She is willing to <strong>increase her monthly investment by the inflation rate</strong>. For example, if the inflation rate is 1%, then instead of investing $2,000, she'll invest $2,020.</p></li><li><p>When she receives dividends (once per year), she will automatically reinvest them into the index.</p></li></ul><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text">A <strong>dividend </strong>is a small payment made by a company to its shareholders. In the context of typical indices that track the S&P 500, the yearly dividend was <strong>between 1% and 4%</strong> from 1985 to 2024.</div></div><p>It's as simple as it gets as far as investing goes. It will be interesting to take a look at two main metrics: her <strong>nominal net worth</strong>, and her <strong>real net worth</strong>.</p><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text"><strong>Nominal money</strong> is the amount of money measured in current dollars without adjusting for inflation, whereas <strong>real money</strong> accounts for inflation by measuring how valuable your money is in today's terms.</div></div><p>Essentially, the most important number to consider is the <strong>real</strong> net worth. When planning for the future, we need to understand how much our money will be worth in the future. Therefore, it's crucial to account for inflation and focus on the real net worth. Nonetheless, we can plot both Sabrina's real and nominal net worth after 40 years of investing in the S&P 500.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1719637045237/f1383208-9fff-4b03-ac5a-7facfb4ad706.png" alt class="image--center mx-auto" /></p><p>In the lighter blue, we can see that Sabrina will have a net worth of $17,100,000, which is equivalent to enjoying a 10% year-over-year (YoY) return. However, as explained previously, life (on Earth) is a lot more expensive in the future, and inflation was able to grow at an exponential rate. Once we take into account inflation, we get the darker blue curve, which illustrates that Sabrina has $8,200,000 of "real" money (equivalent to a little less than 7% YoY return).</p><p>There are a couple interesting things to note here:</p><ul><li><p>The <strong>gap</strong> between the light blue (nominal) and the dark blue (true) lines <strong>seems to get wider and wider as time goes on</strong>, even though she is increasing her monthly investments to follow the inflation rate. This showcases the magnitude of the inflation rate, which cuts Sabrina's spending power in half. More on this below.</p></li><li><p>This method of investing is similar to enjoying a 7% YoY return. When people say that investing in the S&P yields roughly 10% per year, this is only true at the nominal level, and therefore doesn't mean much.</p></li></ul><p>For those who prefer tables, we can see the growth in both nominal and real terms, as well as the percentage difference between the two.</p><div class="hn-table"><table><thead><tr><td>Year</td><td>Nominal Net Worth</td><td>Real Net Worth</td><td>Percentage Difference</td></tr></thead><tbody><tr><td>Year 0</td><td>$50,000</td><td>$50,000</td><td>0%</td></tr><tr><td>Year 10</td><td>$680,000</td><td>$538,000</td><td>-21%</td></tr><tr><td>Year 20</td><td>$2,630,000</td><td>$1,780,000</td><td>-32%</td></tr><tr><td>Year 30</td><td>$6,350,000</td><td>$3,830,000</td><td>-40%</td></tr><tr><td>Year 39</td><td>$17,120,000</td><td>$8,180,000</td><td>-52%</td></tr></tbody></table></div><p>As explained by the first bullet point above, inflation does not seem to matter much in the first 10 years. The difference between nominal and real net worth is only about 21% (or $142,000). However, once we allow a lot more time to pass (39 years), we see that inflation has wiped out almost half of Sabrina's spending power (which amounts to more than $8,000,000!). Even though the average inflation rate was around 3.5% in the United States from 1985 to 2024, it was solely responsible for a 50% decrease in Sabrina's spending power...</p><h3 id="heading-investing-for-40-years-stagnant-monthly-investments-and-no-dividends">Investing for 40 Years - Stagnant Monthly Investments and No Dividends</h3><p>Let's imagine two other investors (we'll call them Mark and Donald), who are going to follow the same investing rules as Sabrina, except for two things:</p><ul><li><p>Mark is <strong>not willing to increase his monthly investment by the inflation rate</strong>. He will always invest $2,000 per month for the next 40 years.</p></li><li><p>Donald is <strong>not willing to reinvest his yearly dividends</strong> and decides to spend them on luxury goods instead.</p></li></ul><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text"><strong>Side note</strong>: I called <code>M</code>ark "<code>M</code>ark" because he doesn't increase his <code>M</code>onthly investments, and <code>D</code>onald "<code>D</code>onald" because he spends his yearly <code>D</code>ividends.</div></div><p>To summarize the differences between Sabrina, Mark, and Donald, we can refer to this table:</p><div class="hn-table"><table><thead><tr><td></td><td>Sabrina</td><td>Mark</td><td>Donald</td></tr></thead><tbody><tr><td>Starting Investment</td><td>$50,000</td><td>$50,000</td><td>$50,000</td></tr><tr><td>Monthly Investment</td><td>$2,000</td><td>$2,000</td><td>$2,000</td></tr><tr><td>Reinvesting Dividends</td><td>Yes</td><td>Yes</td><td>No</td></tr><tr><td>Increase Monthly Investments</td><td>Yes</td><td>No</td><td>Yes</td></tr></tbody></table></div><p>What will be the impact of Sabrina and Donald increasing their investments to keep up with inflation compared to Mark, who did not? What about the impact of Donald consistently taking out his dividends to spend on other things? We can look at the next graph.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1719679483831/b6eb1c77-fdfc-4c5c-bad2-5d8114956eb6.png" alt class="image--center mx-auto" /></p><p>As we already know, Sabrina ends up with roughly $8,200,000 of real net worth. This is compared to Mark's $5,900,000 and Donald's $5,100,000. If the primary goal is to save the most money for retirement, the moral of the story is quite clear:</p><ul><li><p>For the periods between 1985 and 2024, investors who increased their investments by the going inflation rate had a large advantage compared to those who did not. In the short term, this meant increasing the yearly amount by only 3% (ish), which is equivalent to $60 (if we're investing $2,000 per month). Those seemingly minor incremental increases (like the $60) was equivalent to over $2,000,000 in the long run.</p></li><li><p>For the periods between 1985 and 2024, even though dividends were between 1% and 4%, reinvesting instead of spending them would have a monstrous impact, increasing total net worth by 61%!</p></li></ul><h3 id="heading-conclusion">Conclusion</h3><p>Exponential growth is a powerful concept that should significantly impact our financial decisions, especially when it comes to investing and dealing with inflation. By understanding how inflation eats away at your cumulative wealth over time, we can make more informed choices about where to place our savings.</p><p>Investing in the stock market, especially in index funds like the S&P 500, can help reduce the negative effects of inflation and grow our wealth over time. As shown above, two key strategies to combat inflation are (1) consistently increasing investments to match inflation and (2) reinvesting dividends. By following these practices, and assuming similar market behaviour in the future, investors can expect upwards of 60% more retirement savings after 35+ years compared to not using these strategies.</p><blockquote><p>Visit my GitHub page (<a target="_blank" href="https://github.com/mathieutorchia/SP500-Investing-Project">link</a>) to learn more about the python code that made this article possible.</p></blockquote><p><strong>Disclaimer</strong>: The information provided in this article is for general informational purposes only and is based on historical data of the S&P 500 from 1983 to 2024. Past performance is not indicative of future results, and the financial markets are subject to various risks and uncertainties. This article does not constitute financial advice and should not be taken as such. Readers are encouraged to conduct their own research and consult with a qualified financial advisor before making any investment decisions. The author and publisher of this article are not responsible for any financial losses or damages incurred from following the information presented herein.</p>]]><![CDATA[<blockquote><p>Visit my GitHub page (<a target="_blank" href="https://github.com/mathieutorchia/SP500-Investing-Project">link</a>) to learn more about the python code that made this article possible.</p></blockquote><p>Isn't exponential growth a mind-blowing concept? I remember first learning about it and not thinking much of it. However, the more I thought about it, the more I realized how common it is around us, and how powerful it can be. It's easy to say things like: "<em>Only 42 people have COVID in the United States, what's the big deal?</em> "or "<em>Sure I'll sign the mortgage, the 5% fixed interest rate is not that bad." or "I'll keep my money under my mattress, I don't want to risk it in the stock market"</em>. At first glance, these statements seem reasonable. However, when we consider their compounding nature, they take on a completely different meaning:</p><ul><li><p>In March 2020 (in the US), the number of COVID cases jumped from 42 at the start of the month to 185,000 by the end (a 4400x increase!).</p></li><li><p>Even if the fixed interest rate is set at 5%, mortgages usually last for many decades. This means that 5% is compounded yearly and could end up growing to over 100% of the home's value.</p></li><li><p>If you kept your money under your mattress from 1983 to 2024, it would be worth a third of its initial value (in real terms).</p></li></ul><p>While there are countless examples of exponential growth that affect us in our lives, we will be focusing on it in the context of investing in the stock market. In this article, <strong>we will explore the long-term effects of periodically investing in the stock market</strong>, as well as <strong>how inflation affects your bottom line</strong>.</p><h3 id="heading-inflation">Inflation</h3><p>Before getting into the importance/benefits of investing in the stock market, let's explore the <em>worst</em> thing that was ever invented: inflation.</p><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text"><strong>Inflation</strong> is the term used to describe how the money we have today will be worth less in the future.</div></div><p>One of the ways we track inflation is by looking at the consumer price index (CPI), which <strong>measures the average change over time in the prices paid by consumers for a basket of goods and services</strong>. For example, if inflation was 3% last year, this means that, on average, something that used to cost $100 will now cost $103. In other words, your money is losing value... every. day.</p><p>So, how bad is it? We can plot the CPI in the United States from 1947 to 2024 to help answer that question.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1719626639204/c1158a85-9580-4a36-a607-66f4ccdc87c3.png" alt class="image--center mx-auto" /></p><p>This shows the CPI to be 100 in 1983, in contrast to roughly 300 in 2024. This means that $100 of typical expenses in 1983 would cost 3x that amount in 2024!</p><p>At first, this is terrifying. Thankfully, there is a way to bypass some of the negative impacts that come with inflation, and that is for investors to place their money somewhere it will appreciate in value: real estate, private lending, <s>loansharking</s>, or investing in the stock market.</p><h3 id="heading-the-stock-market">The Stock Market</h3><p>There are many ways to invest in the stock market. For the purpose of this article, we will be looking at investing in an index fund (like <code>TSE: XSP</code>) that mimics the Standard & Poor's 500. The S&P 500 tracks the performance of the 500 largest companies in the United States. In fact, these top 500 companies make up roughly <strong>80% of the total U.S. equity market capitalization</strong> (source: <a target="_blank" href="https://www.morningstar.ca/ca/news/185437/sp-500-or-total-stock-market-index-for-us-exposure.aspx#:~:text=Stocks%20in%20the%20S%26P%20500,the%20presence%20of%20smaller%20stocks.">Morning Star</a>). Investing in an index fund that includes all the companies in the S&P 500 is common advice, even from people like Warren Buffet:</p><blockquote><p>"In my view, for most people, the best thing to do is own the S&P 500 index fund. The trick is not to pick the right company. The trick is to essentially buy all the big companies through the S&P 500 and to do it consistently and to do it in a very, very low-cost way".</p></blockquote><p>Now, let's look into the outcome of investing in the S&P 500 for 40 years, from 1983 to the end of 2023.</p><h3 id="heading-investing-for-40-years-increasing-monthly-investments">Investing for 40 Years - Increasing Monthly Investments</h3><p>Let's imagine an investor (we'll call her Sabrina) who is about to embark on a lifetime of textbook S&P investing:</p><ul><li><p>She has purchased $50,000 worth of shares (at the start of 1983).</p></li><li><p>She is willing to invest an additional $2,000 per month from 1983 to 2023.</p></li><li><p>She is willing to <strong>increase her monthly investment by the inflation rate</strong>. For example, if the inflation rate is 1%, then instead of investing $2,000, she'll invest $2,020.</p></li><li><p>When she receives dividends (once per year), she will automatically reinvest them into the index.</p></li></ul><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text">A <strong>dividend </strong>is a small payment made by a company to its shareholders. In the context of typical indices that track the S&P 500, the yearly dividend was <strong>between 1% and 4%</strong> from 1985 to 2024.</div></div><p>It's as simple as it gets as far as investing goes. It will be interesting to take a look at two main metrics: her <strong>nominal net worth</strong>, and her <strong>real net worth</strong>.</p><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text"><strong>Nominal money</strong> is the amount of money measured in current dollars without adjusting for inflation, whereas <strong>real money</strong> accounts for inflation by measuring how valuable your money is in today's terms.</div></div><p>Essentially, the most important number to consider is the <strong>real</strong> net worth. When planning for the future, we need to understand how much our money will be worth in the future. Therefore, it's crucial to account for inflation and focus on the real net worth. Nonetheless, we can plot both Sabrina's real and nominal net worth after 40 years of investing in the S&P 500.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1719637045237/f1383208-9fff-4b03-ac5a-7facfb4ad706.png" alt class="image--center mx-auto" /></p><p>In the lighter blue, we can see that Sabrina will have a net worth of $17,100,000, which is equivalent to enjoying a 10% year-over-year (YoY) return. However, as explained previously, life (on Earth) is a lot more expensive in the future, and inflation was able to grow at an exponential rate. Once we take into account inflation, we get the darker blue curve, which illustrates that Sabrina has $8,200,000 of "real" money (equivalent to a little less than 7% YoY return).</p><p>There are a couple interesting things to note here:</p><ul><li><p>The <strong>gap</strong> between the light blue (nominal) and the dark blue (true) lines <strong>seems to get wider and wider as time goes on</strong>, even though she is increasing her monthly investments to follow the inflation rate. This showcases the magnitude of the inflation rate, which cuts Sabrina's spending power in half. More on this below.</p></li><li><p>This method of investing is similar to enjoying a 7% YoY return. When people say that investing in the S&P yields roughly 10% per year, this is only true at the nominal level, and therefore doesn't mean much.</p></li></ul><p>For those who prefer tables, we can see the growth in both nominal and real terms, as well as the percentage difference between the two.</p><div class="hn-table"><table><thead><tr><td>Year</td><td>Nominal Net Worth</td><td>Real Net Worth</td><td>Percentage Difference</td></tr></thead><tbody><tr><td>Year 0</td><td>$50,000</td><td>$50,000</td><td>0%</td></tr><tr><td>Year 10</td><td>$680,000</td><td>$538,000</td><td>-21%</td></tr><tr><td>Year 20</td><td>$2,630,000</td><td>$1,780,000</td><td>-32%</td></tr><tr><td>Year 30</td><td>$6,350,000</td><td>$3,830,000</td><td>-40%</td></tr><tr><td>Year 39</td><td>$17,120,000</td><td>$8,180,000</td><td>-52%</td></tr></tbody></table></div><p>As explained by the first bullet point above, inflation does not seem to matter much in the first 10 years. The difference between nominal and real net worth is only about 21% (or $142,000). However, once we allow a lot more time to pass (39 years), we see that inflation has wiped out almost half of Sabrina's spending power (which amounts to more than $8,000,000!). Even though the average inflation rate was around 3.5% in the United States from 1985 to 2024, it was solely responsible for a 50% decrease in Sabrina's spending power...</p><h3 id="heading-investing-for-40-years-stagnant-monthly-investments-and-no-dividends">Investing for 40 Years - Stagnant Monthly Investments and No Dividends</h3><p>Let's imagine two other investors (we'll call them Mark and Donald), who are going to follow the same investing rules as Sabrina, except for two things:</p><ul><li><p>Mark is <strong>not willing to increase his monthly investment by the inflation rate</strong>. He will always invest $2,000 per month for the next 40 years.</p></li><li><p>Donald is <strong>not willing to reinvest his yearly dividends</strong> and decides to spend them on luxury goods instead.</p></li></ul><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text"><strong>Side note</strong>: I called <code>M</code>ark "<code>M</code>ark" because he doesn't increase his <code>M</code>onthly investments, and <code>D</code>onald "<code>D</code>onald" because he spends his yearly <code>D</code>ividends.</div></div><p>To summarize the differences between Sabrina, Mark, and Donald, we can refer to this table:</p><div class="hn-table"><table><thead><tr><td></td><td>Sabrina</td><td>Mark</td><td>Donald</td></tr></thead><tbody><tr><td>Starting Investment</td><td>$50,000</td><td>$50,000</td><td>$50,000</td></tr><tr><td>Monthly Investment</td><td>$2,000</td><td>$2,000</td><td>$2,000</td></tr><tr><td>Reinvesting Dividends</td><td>Yes</td><td>Yes</td><td>No</td></tr><tr><td>Increase Monthly Investments</td><td>Yes</td><td>No</td><td>Yes</td></tr></tbody></table></div><p>What will be the impact of Sabrina and Donald increasing their investments to keep up with inflation compared to Mark, who did not? What about the impact of Donald consistently taking out his dividends to spend on other things? We can look at the next graph.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1719679483831/b6eb1c77-fdfc-4c5c-bad2-5d8114956eb6.png" alt class="image--center mx-auto" /></p><p>As we already know, Sabrina ends up with roughly $8,200,000 of real net worth. This is compared to Mark's $5,900,000 and Donald's $5,100,000. If the primary goal is to save the most money for retirement, the moral of the story is quite clear:</p><ul><li><p>For the periods between 1985 and 2024, investors who increased their investments by the going inflation rate had a large advantage compared to those who did not. In the short term, this meant increasing the yearly amount by only 3% (ish), which is equivalent to $60 (if we're investing $2,000 per month). Those seemingly minor incremental increases (like the $60) was equivalent to over $2,000,000 in the long run.</p></li><li><p>For the periods between 1985 and 2024, even though dividends were between 1% and 4%, reinvesting instead of spending them would have a monstrous impact, increasing total net worth by 61%!</p></li></ul><h3 id="heading-conclusion">Conclusion</h3><p>Exponential growth is a powerful concept that should significantly impact our financial decisions, especially when it comes to investing and dealing with inflation. By understanding how inflation eats away at your cumulative wealth over time, we can make more informed choices about where to place our savings.</p><p>Investing in the stock market, especially in index funds like the S&P 500, can help reduce the negative effects of inflation and grow our wealth over time. As shown above, two key strategies to combat inflation are (1) consistently increasing investments to match inflation and (2) reinvesting dividends. By following these practices, and assuming similar market behaviour in the future, investors can expect upwards of 60% more retirement savings after 35+ years compared to not using these strategies.</p><blockquote><p>Visit my GitHub page (<a target="_blank" href="https://github.com/mathieutorchia/SP500-Investing-Project">link</a>) to learn more about the python code that made this article possible.</p></blockquote><p><strong>Disclaimer</strong>: The information provided in this article is for general informational purposes only and is based on historical data of the S&P 500 from 1983 to 2024. Past performance is not indicative of future results, and the financial markets are subject to various risks and uncertainties. This article does not constitute financial advice and should not be taken as such. Readers are encouraged to conduct their own research and consult with a qualified financial advisor before making any investment decisions. The author and publisher of this article are not responsible for any financial losses or damages incurred from following the information presented herein.</p>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1719685209843/7e8829d8-2282-4167-b198-2823f4172e30.png<![CDATA[Blackjack: Is It Beatable?]]>https://mathieutorchia.com/blackjack-is-it-beatablehttps://mathieutorchia.com/blackjack-is-it-beatableTue, 11 Jun 2024 04:22:25 GMT<![CDATA[<p>This is my first blog and first Python coding project. I have always been fascinated with casino games, especially Blackjack and Texas Hold'em. The appeal of Texas Hold'em is that if played correctly, it can be beaten. Why? Because your opponents are humans, and humans make mistakes. All I had to do was capitalize on their mistakes to make money in the long term. But Blackjack... Your opponent in Blackjack is not a potentially drunken human staying up too late on a Sunday night. Your opponent is the infamous <em>house</em>, which has supposedly made all the calculations to ensure they will win in the end. No matter how well you play, <strong>you will lose in the long run, they say</strong>. So, I decided to turn my first coding project into a Blackjack simulator to see if that statement holds true.</p><hr /><h3 id="heading-player-and-dealer-logic">Player and Dealer Logic</h3><p>This section will explain the various choices that were made to make a good attempt at modelling the game of Blackjack. Here are the main important points:</p><ul><li><p>There is an number of cards (we do not assume a finite number of decks)</p></li><li><p>The dealer stands on soft 17</p></li><li><p>A blackjack pays 3 to 2 (unless the player split aces)</p></li><li><p>The player can double</p></li><li><p>The player can split equal cards as many times as they'd like</p></li><li><p>The player can only split aces once, and can only receive one additional card for each ace</p></li><li><p>The player cannot surrender</p></li><li><p>The player plays the <em>optimal strategy</em> as shown in <a target="_blank" href="https://www.blackjackapprenticeship.com/blackjack-strategy-charts/">this</a> diagram.</p></li><li><p>The player bets $10 per hand</p></li></ul><h3 id="heading-simulation-logic">Simulation Logic</h3><p>Once we have the logic of the game properly coded, the rest is easy: <em>just press run 100,000 times and manually record the results with a pen and paper</em>. Kidding, of course. We can simply write a little bit of code that will run the game as often as we'd like, while recording the relevant information. But what kind of "relevant information" do we need? We decided to record the following information for every hand that was played:</p><ul><li><p>The sum of the player's first two cards</p></li><li><p>The sum of all the player's cards (when he is done playing)</p></li><li><p>The dealer's first card</p></li><li><p>The sum of all the dealer's cards (when he is done playing)</p></li><li><p>A boolean where <em>true</em> signifies that the player split his cards during this specific hand, and <em>false</em> signifies that the player did not split his cards</p></li><li><p>The result (W, L, T)</p></li><li><p>The money won or lost</p></li><li><p>The cumulative total money won or lost for a given player</p></li></ul><p>There are three other things that we record for every hand, which can be confusing to explain, so I would like to take the time to elaborate on it here. Here they are:</p><ol><li><p>The simulation number (one game)</p></li><li><p>The hand number</p></li><li><p>The meta simulation number (one player)</p></li></ol><p>The <strong>simulation number</strong> is used to track which iteration of the game we are currently on. For example, let's say the player is playing his 54th game (in other words, his 54th simulation), and is showing an [8,8]. He then decides to take one more card and gets to [8,8,5] for a total of 21. In this case the simulation number would be 54 since this was the 54th simulation for this particular player.</p><p>Using the example from above, his <strong>hand number</strong> would be "0", since this was the first hand he played in the 54th simulation number. Most of the time, the hand number will be "0", since players usually play one hand per game. However, for example, if a player is showing an [8,8] and decides to split his cards, he will then be given 2 hands: [8,3] and [8,5]. Now the [8,3] will be categorized as hand "0", and the [8,5] will be categorized as hand "1".</p><p>Finally, the <strong>meta simulation number</strong> categorizes a given player. As you'll see in the next section, we simulate the game of Blackjack for 100 different players, or, in other words, we have 100 different meta simulations. Let's say player 3 is playing his 55th game and has a [6,5,10], his meta simulation number would be 3, the simulation number would be 55, and the hand number would be 0.</p><p>Putting everything together, we get a table that looks like this:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717993361272/ed324eeb-7b8c-417a-b06f-099cf86e2de6.png" alt class="image--center mx-auto" /></p><p>To help read it, the first row shows that for meta simulation number 1 (or player 1), for their first game (first simulation number), and for their first hand (hand 0), their first two cards gave them a total of 16. However, since the dealer was showing a 10, this made them have to "hit" which put them at a total of 26. The final result is a "L (since their sum went above 21) and so the player lost $10, and is currently sitting at -$10 in cumulative earnings.</p><hr /><h3 id="heading-results-first-simulation">Results - First Simulation</h3><p>The first simulation runs the Blackjack game 100 times (100 <strong>simulations</strong>), records the results, and then does this 100 times (100 <strong>meta simulations</strong>). Therefore, in total, there are 10,000 simulations being played (with 10,293 hands since the player split a total of 293 times throughout the simulation). The purpose of running the simulation this way is twofold:</p><ol><li><p>To mimic a scenario where a single person plays for 2 hours, which comes down to approximately 100 hands</p></li><li><p>To simulate observing 100 different people, each playing 100 hands</p></li></ol><p>With each coloured line representing a given player in the figure below, we plot the total profit per player over 100 hands, assuming the player bets $10 per hand:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717916027481/c39df27d-66db-46cf-baba-87164a7a76c1.png" alt class="image--center mx-auto" /></p><p>It's interesting to note that the graph doesn't clearly show that the house has an edge. There are a lot of players that are above the $0 line (meaning they secured a profit at the end of the 100 simulations). In fact, in this simulation, we find that out of the 100 players:</p><ul><li><p>45 made a profit , where the average profit was $0.87 <strong>per hand</strong></p></li><li><p>54 made a loss, where the average loss was $0.87 <strong>per hand</strong></p></li><li><p>1 broke even</p></li></ul><p>From the points above, it looks like the player wins 45% of the time. However, this is not the case, since the following shows the amount of individual hands that were won, lost, tied, and the average profit:</p><div class="hn-table"><table><thead><tr><td>Result</td><td>Count</td><td>Count (%)</td><td>Average Profit</td></tr></thead><tbody><tr><td>Win</td><td>4,449</td><td>43.22%</td><td>$11.84</td></tr><tr><td>Loss</td><td>4,895</td><td>47.56%</td><td>-$10.93</td></tr><tr><td>Tie</td><td>949</td><td>9.22%</td><td>$0</td></tr><tr><td><strong>Total</strong></td><td><strong>10,293</strong></td><td><strong>100%</strong></td></tr></tbody></table></div><p><strong>QUESTION</strong>: Even though the Player wins at a lower rate than the dealer (43.22% vs 47.56%), the average gain is greater than the average loss ($11.84 vs $10.93). So what's the final result? Are we profitable?</p><p>To answer that question, we can plot the distribution of the <strong>average profit per hand</strong> for every player. Essentially, we look at a given meta simulation <em>i</em> (which represents a player), and we apply the following formula:</p><p>$$\frac{\text{(Total Profit or Loss)}_i}{\text{(Total Number of Hands)}_i}$$</p><p>So, if a player ended with a profit of $50 after 100 simulations (and 105 hands), then his average profit per hand would be:</p><p>$$\frac{$50}{105\text{ hands}} =$0.48\text{/hand}$$</p><p>We do this for each player (each meta simulation), and we plot them in the following bar chart:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717919095116/94e20971-f15a-4b32-abb9-0acd4cdd3af4.png" alt class="image--center mx-auto" /></p><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text">Keep in mind these values are at the "per hand" level. So if the average loss per hand is $3, then this would come to roughly $300 total loss if a player played 100 hands.</div></div><p>In the plot above, we can see at the most left point, for example, there are roughly 2.5% of players that ended their night with an average loss per hand between -$3 and -2$. However, the red dotted line shows that the average result is a $0.08 loss. We can now answer the question from above.</p><p><strong>ANSWER</strong>: There are two opposing dynamics: the dealer tends to win more hands, but when the player wins, their winnings are typically larger. The question is, which dynamic has a greater impact overall? Unfortunately, even though the player wins more money during a winning hand, <strong>the dealer wins too many hands for the player to remain profitable</strong>, which is why we see an average loss of $0.08 as shown in the figure above.</p><p>You may be wondering, isn't the sample size of 10,000 total simulations a little small? Maybe this is just a fluke? You may be correct, so let's run this simulation 1,000,000 times.</p><h3 id="heading-results-second-simulation">Results - Second Simulation</h3><p>In this new simulation, we keep everything the same, except we allow each of the 100 players (meta_simulations) to play 10,000 simulations (instead of 100). This will give us a more accurate depiction of the long-term reality when playing Blackjack.</p><p>Let's start by plotting the profit per player across the 10,000 simulations:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717956858777/5e3bbc4f-741b-452a-ab8b-d410f1db586c.png" alt class="image--center mx-auto" /></p><p>In contrast to the first time we ran this graph, we can more clearly see that there are more cases where a player seems to end in the area below the $0 profit line. In fact, in this simulation, we find that out of the 100 players:</p><ul><li><p>25 made a profit , where the average profit was $0.08 <strong>per hand</strong></p></li><li><p>75 made a loss, where the average loss was $0.13 <strong>per hand</strong></p></li></ul><p>This also shows that the more hands a player plays, the less likely it is that they will end the day with a profit. Even though the casino's edge is very small, it adds up quite quickly in the long run.</p><p>When we rerun the distribution plot (as shown below), we also see a similar result, where the average loss per hand is steadily at $0.08 per hand.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717957250920/5a540219-3e98-4022-b3fd-5f801d8abe47.png" alt class="image--center mx-auto" /></p><p>The distribution is significantly narrower compared to the previous simulation. Earlier, the average profit per 10 hands ranged from -$4 to +$4. Now, it fluctuates only between -$0.30 and +$0.20. This reduction in variance is due to the increased number of simulations played (10,000 versus 100). With more simulations played, the results tend to cluster closely around the true average profit per hand.</p><p>Consider this analogy: if you flip a coin 10 times, its believable to get heads more than 70% of the time (7 or more heads). The probability of this occurrence is roughly 5%. However, if you flip the coin 1,000,000 times, it becomes virtually impossible for heads to appear more than 70% of the time (700,000+ heads). The probability is nearly 0%, as this outcome would be around 400 standard deviations away from the mean. This is an example of the <a target="_blank" href="https://www.britannica.com/science/law-of-large-numbers">Law of Large Numbers</a>, which was proven by the Swiss mathematician Jakob Bernoulli in 1713, and is widely used in the field of statistics, economics, mathematics, etc.</p><h3 id="heading-conclusion">Conclusion</h3><p>Our analysis shows that, for this specific type of Blackjack game and assuming the player uses the optimal strategy described <a target="_blank" href="https://www.blackjackapprenticeship.com/blackjack-strategy-charts/">here</a>, players will, on average, lose $0.08 per $10 hand. <strong>This indicates a house edge of about 0.8%</strong>. This edge is highly specific to the conditions outlined at the start of the article. If any of these conditions changesuch as a 6 to 5 Blackjack payout, the number of decks used, or restrictions on splittingthe house edge will also change. However, regardless of the Blackjack variation, the house will always have an advantage. The only way to potentially shift the odds in your favor is through card counting, which allows you to adjust your strategy based on the current state of the deck. Implementing a card counting feature in the code could be an interesting next step to explore when it turns the profit in favour of the player.</p><p>Thank you for reading the very first blog to be posted on mathieutorchia.com. I am excited to explore more questions and learn new methods on Python along the way. Follow my <strong>GitHub</strong> <a target="_blank" href="https://github.com/mathieutorchia/blackjack_simulator">here</a> to see the Python code that made this article possible (I am still learning the ins and outs about GitHub so please bear with me - if anyone has any suggestions, I am all ears). And of course, feel free to reach out if you have any questions or comments!</p>]]><![CDATA[<p>This is my first blog and first Python coding project. I have always been fascinated with casino games, especially Blackjack and Texas Hold'em. The appeal of Texas Hold'em is that if played correctly, it can be beaten. Why? Because your opponents are humans, and humans make mistakes. All I had to do was capitalize on their mistakes to make money in the long term. But Blackjack... Your opponent in Blackjack is not a potentially drunken human staying up too late on a Sunday night. Your opponent is the infamous <em>house</em>, which has supposedly made all the calculations to ensure they will win in the end. No matter how well you play, <strong>you will lose in the long run, they say</strong>. So, I decided to turn my first coding project into a Blackjack simulator to see if that statement holds true.</p><hr /><h3 id="heading-player-and-dealer-logic">Player and Dealer Logic</h3><p>This section will explain the various choices that were made to make a good attempt at modelling the game of Blackjack. Here are the main important points:</p><ul><li><p>There is an number of cards (we do not assume a finite number of decks)</p></li><li><p>The dealer stands on soft 17</p></li><li><p>A blackjack pays 3 to 2 (unless the player split aces)</p></li><li><p>The player can double</p></li><li><p>The player can split equal cards as many times as they'd like</p></li><li><p>The player can only split aces once, and can only receive one additional card for each ace</p></li><li><p>The player cannot surrender</p></li><li><p>The player plays the <em>optimal strategy</em> as shown in <a target="_blank" href="https://www.blackjackapprenticeship.com/blackjack-strategy-charts/">this</a> diagram.</p></li><li><p>The player bets $10 per hand</p></li></ul><h3 id="heading-simulation-logic">Simulation Logic</h3><p>Once we have the logic of the game properly coded, the rest is easy: <em>just press run 100,000 times and manually record the results with a pen and paper</em>. Kidding, of course. We can simply write a little bit of code that will run the game as often as we'd like, while recording the relevant information. But what kind of "relevant information" do we need? We decided to record the following information for every hand that was played:</p><ul><li><p>The sum of the player's first two cards</p></li><li><p>The sum of all the player's cards (when he is done playing)</p></li><li><p>The dealer's first card</p></li><li><p>The sum of all the dealer's cards (when he is done playing)</p></li><li><p>A boolean where <em>true</em> signifies that the player split his cards during this specific hand, and <em>false</em> signifies that the player did not split his cards</p></li><li><p>The result (W, L, T)</p></li><li><p>The money won or lost</p></li><li><p>The cumulative total money won or lost for a given player</p></li></ul><p>There are three other things that we record for every hand, which can be confusing to explain, so I would like to take the time to elaborate on it here. Here they are:</p><ol><li><p>The simulation number (one game)</p></li><li><p>The hand number</p></li><li><p>The meta simulation number (one player)</p></li></ol><p>The <strong>simulation number</strong> is used to track which iteration of the game we are currently on. For example, let's say the player is playing his 54th game (in other words, his 54th simulation), and is showing an [8,8]. He then decides to take one more card and gets to [8,8,5] for a total of 21. In this case the simulation number would be 54 since this was the 54th simulation for this particular player.</p><p>Using the example from above, his <strong>hand number</strong> would be "0", since this was the first hand he played in the 54th simulation number. Most of the time, the hand number will be "0", since players usually play one hand per game. However, for example, if a player is showing an [8,8] and decides to split his cards, he will then be given 2 hands: [8,3] and [8,5]. Now the [8,3] will be categorized as hand "0", and the [8,5] will be categorized as hand "1".</p><p>Finally, the <strong>meta simulation number</strong> categorizes a given player. As you'll see in the next section, we simulate the game of Blackjack for 100 different players, or, in other words, we have 100 different meta simulations. Let's say player 3 is playing his 55th game and has a [6,5,10], his meta simulation number would be 3, the simulation number would be 55, and the hand number would be 0.</p><p>Putting everything together, we get a table that looks like this:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717993361272/ed324eeb-7b8c-417a-b06f-099cf86e2de6.png" alt class="image--center mx-auto" /></p><p>To help read it, the first row shows that for meta simulation number 1 (or player 1), for their first game (first simulation number), and for their first hand (hand 0), their first two cards gave them a total of 16. However, since the dealer was showing a 10, this made them have to "hit" which put them at a total of 26. The final result is a "L (since their sum went above 21) and so the player lost $10, and is currently sitting at -$10 in cumulative earnings.</p><hr /><h3 id="heading-results-first-simulation">Results - First Simulation</h3><p>The first simulation runs the Blackjack game 100 times (100 <strong>simulations</strong>), records the results, and then does this 100 times (100 <strong>meta simulations</strong>). Therefore, in total, there are 10,000 simulations being played (with 10,293 hands since the player split a total of 293 times throughout the simulation). The purpose of running the simulation this way is twofold:</p><ol><li><p>To mimic a scenario where a single person plays for 2 hours, which comes down to approximately 100 hands</p></li><li><p>To simulate observing 100 different people, each playing 100 hands</p></li></ol><p>With each coloured line representing a given player in the figure below, we plot the total profit per player over 100 hands, assuming the player bets $10 per hand:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717916027481/c39df27d-66db-46cf-baba-87164a7a76c1.png" alt class="image--center mx-auto" /></p><p>It's interesting to note that the graph doesn't clearly show that the house has an edge. There are a lot of players that are above the $0 line (meaning they secured a profit at the end of the 100 simulations). In fact, in this simulation, we find that out of the 100 players:</p><ul><li><p>45 made a profit , where the average profit was $0.87 <strong>per hand</strong></p></li><li><p>54 made a loss, where the average loss was $0.87 <strong>per hand</strong></p></li><li><p>1 broke even</p></li></ul><p>From the points above, it looks like the player wins 45% of the time. However, this is not the case, since the following shows the amount of individual hands that were won, lost, tied, and the average profit:</p><div class="hn-table"><table><thead><tr><td>Result</td><td>Count</td><td>Count (%)</td><td>Average Profit</td></tr></thead><tbody><tr><td>Win</td><td>4,449</td><td>43.22%</td><td>$11.84</td></tr><tr><td>Loss</td><td>4,895</td><td>47.56%</td><td>-$10.93</td></tr><tr><td>Tie</td><td>949</td><td>9.22%</td><td>$0</td></tr><tr><td><strong>Total</strong></td><td><strong>10,293</strong></td><td><strong>100%</strong></td></tr></tbody></table></div><p><strong>QUESTION</strong>: Even though the Player wins at a lower rate than the dealer (43.22% vs 47.56%), the average gain is greater than the average loss ($11.84 vs $10.93). So what's the final result? Are we profitable?</p><p>To answer that question, we can plot the distribution of the <strong>average profit per hand</strong> for every player. Essentially, we look at a given meta simulation <em>i</em> (which represents a player), and we apply the following formula:</p><p>$$\frac{\text{(Total Profit or Loss)}_i}{\text{(Total Number of Hands)}_i}$$</p><p>So, if a player ended with a profit of $50 after 100 simulations (and 105 hands), then his average profit per hand would be:</p><p>$$\frac{$50}{105\text{ hands}} =$0.48\text{/hand}$$</p><p>We do this for each player (each meta simulation), and we plot them in the following bar chart:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717919095116/94e20971-f15a-4b32-abb9-0acd4cdd3af4.png" alt class="image--center mx-auto" /></p><div data-node-type="callout"><div data-node-type="callout-emoji">ðŸ’¡</div><div data-node-type="callout-text">Keep in mind these values are at the "per hand" level. So if the average loss per hand is $3, then this would come to roughly $300 total loss if a player played 100 hands.</div></div><p>In the plot above, we can see at the most left point, for example, there are roughly 2.5% of players that ended their night with an average loss per hand between -$3 and -2$. However, the red dotted line shows that the average result is a $0.08 loss. We can now answer the question from above.</p><p><strong>ANSWER</strong>: There are two opposing dynamics: the dealer tends to win more hands, but when the player wins, their winnings are typically larger. The question is, which dynamic has a greater impact overall? Unfortunately, even though the player wins more money during a winning hand, <strong>the dealer wins too many hands for the player to remain profitable</strong>, which is why we see an average loss of $0.08 as shown in the figure above.</p><p>You may be wondering, isn't the sample size of 10,000 total simulations a little small? Maybe this is just a fluke? You may be correct, so let's run this simulation 1,000,000 times.</p><h3 id="heading-results-second-simulation">Results - Second Simulation</h3><p>In this new simulation, we keep everything the same, except we allow each of the 100 players (meta_simulations) to play 10,000 simulations (instead of 100). This will give us a more accurate depiction of the long-term reality when playing Blackjack.</p><p>Let's start by plotting the profit per player across the 10,000 simulations:</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717956858777/5e3bbc4f-741b-452a-ab8b-d410f1db586c.png" alt class="image--center mx-auto" /></p><p>In contrast to the first time we ran this graph, we can more clearly see that there are more cases where a player seems to end in the area below the $0 profit line. In fact, in this simulation, we find that out of the 100 players:</p><ul><li><p>25 made a profit , where the average profit was $0.08 <strong>per hand</strong></p></li><li><p>75 made a loss, where the average loss was $0.13 <strong>per hand</strong></p></li></ul><p>This also shows that the more hands a player plays, the less likely it is that they will end the day with a profit. Even though the casino's edge is very small, it adds up quite quickly in the long run.</p><p>When we rerun the distribution plot (as shown below), we also see a similar result, where the average loss per hand is steadily at $0.08 per hand.</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717957250920/5a540219-3e98-4022-b3fd-5f801d8abe47.png" alt class="image--center mx-auto" /></p><p>The distribution is significantly narrower compared to the previous simulation. Earlier, the average profit per 10 hands ranged from -$4 to +$4. Now, it fluctuates only between -$0.30 and +$0.20. This reduction in variance is due to the increased number of simulations played (10,000 versus 100). With more simulations played, the results tend to cluster closely around the true average profit per hand.</p><p>Consider this analogy: if you flip a coin 10 times, its believable to get heads more than 70% of the time (7 or more heads). The probability of this occurrence is roughly 5%. However, if you flip the coin 1,000,000 times, it becomes virtually impossible for heads to appear more than 70% of the time (700,000+ heads). The probability is nearly 0%, as this outcome would be around 400 standard deviations away from the mean. This is an example of the <a target="_blank" href="https://www.britannica.com/science/law-of-large-numbers">Law of Large Numbers</a>, which was proven by the Swiss mathematician Jakob Bernoulli in 1713, and is widely used in the field of statistics, economics, mathematics, etc.</p><h3 id="heading-conclusion">Conclusion</h3><p>Our analysis shows that, for this specific type of Blackjack game and assuming the player uses the optimal strategy described <a target="_blank" href="https://www.blackjackapprenticeship.com/blackjack-strategy-charts/">here</a>, players will, on average, lose $0.08 per $10 hand. <strong>This indicates a house edge of about 0.8%</strong>. This edge is highly specific to the conditions outlined at the start of the article. If any of these conditions changesuch as a 6 to 5 Blackjack payout, the number of decks used, or restrictions on splittingthe house edge will also change. However, regardless of the Blackjack variation, the house will always have an advantage. The only way to potentially shift the odds in your favor is through card counting, which allows you to adjust your strategy based on the current state of the deck. Implementing a card counting feature in the code could be an interesting next step to explore when it turns the profit in favour of the player.</p><p>Thank you for reading the very first blog to be posted on mathieutorchia.com. I am excited to explore more questions and learn new methods on Python along the way. Follow my <strong>GitHub</strong> <a target="_blank" href="https://github.com/mathieutorchia/blackjack_simulator">here</a> to see the Python code that made this article possible (I am still learning the ins and outs about GitHub so please bear with me - if anyone has any suggestions, I am all ears). And of course, feel free to reach out if you have any questions or comments!</p>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1717960418739/5dc9cade-58ec-4a7c-bad4-fac6566645be.png