<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Mathieu's Blog]]></title><description><![CDATA[I enjoy questioning anything and everything. This blog is my attempt to tackle the more challenging part: finding answers to those questions.]]></description><link>https://mathieutorchia.com</link><generator>RSS for Node</generator><lastBuildDate>Wed, 22 Apr 2026 08:13:56 GMT</lastBuildDate><atom:link href="https://mathieutorchia.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Are You Making the Right Decision? The Economics of Opportunity and Sunk Costs]]></title><description><![CDATA[A couple of weeks ago, I posted my first LinkedIn poll: “You spend $1,000 on a concert ticket. The day before the event, you notice that the resale value of your ticket is $10,000. Which answer is true”:

You spent $0 if you attend

You spent $1,000 ...]]></description><link>https://mathieutorchia.com/are-you-making-the-right-decision-the-economics-of-opportunity-and-sunk-costs</link><guid isPermaLink="true">https://mathieutorchia.com/are-you-making-the-right-decision-the-economics-of-opportunity-and-sunk-costs</guid><category><![CDATA[opportunity cost]]></category><category><![CDATA[sunk cost]]></category><category><![CDATA[economics]]></category><dc:creator><![CDATA[Mathieu Torchia]]></dc:creator><pubDate>Mon, 17 Feb 2025 04:27:53 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1739766362712/9e1072ca-2a68-45bf-8183-5b0c372ec6e3.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A couple of weeks ago, I posted my first LinkedIn poll: “You spend $1,000 on a concert ticket. The day before the event, you notice that the resale value of your ticket is $10,000. Which answer is true”:</p>
<ul>
<li><p>You spent $0 if you attend</p>
</li>
<li><p>You spent $1,000 if you attend</p>
</li>
<li><p>You spent $10,000 if you attend</p>
</li>
<li><p>You spent $9,000 if you attend</p>
</li>
</ul>
<p>While the second option got the most votes (68%), each of these answers got at least <em>some</em> votes. What is it about this seemingly simple question that is getting people to answer so differently? It all comes down to how people define “cost” and “spending”. More specifically, this question touches on two concepts from economics: <strong>opportunity cost</strong>, and <strong>sunk cost</strong>.</p>
<h3 id="heading-opportunity-cost">Opportunity Cost</h3>
<p>Have you ever been scheduled for a work shift and then asked a coworker to cover for you so you could attend an event you wanted to go to, such as a restaurant outing with close friends? Let’s assume that the food would end up costing you $100 (you ordered a medium rare Chicago style rib steak). If we <strong>do not</strong> take opportunity cost into account, the cost of going is simple: it cost you $100. If we <strong>do</strong> consider the opportunity cost, this night out cost you $100 plus the income you would have earned if you worked your shift that night (let’s say $150), for a total cost of $250.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><strong>Opportunity cost </strong>is defined as the loss of potential gain from other alternatives when one alternative is chosen.</div>
</div>

<p>By solely looking at your bank account, you would only see a decrease of $100, which is why it would be misleading to think that it only cost you that amount. On the other hand, by not going, you might miss out on the experience of spending time with friends, which might be worth more than the $250 you’re “spending” by choosing to go out.</p>
<h3 id="heading-sunk-cost">Sunk Cost</h3>
<p>The other concept we need to define is sunk cost. Let’s say you’re planning a trip to Italy and you purchase a $100 ticket to visit a museum on a specific day. When you arrive at FCO (Rome’s airport), you notice that the weather forecast shows that the only sunny day of your trip is the same day you planned to visit the museum indoors... Now, you face a decision: should you visit the museum or seize the sunny day to enjoy Italy’s beautiful weather?</p>
<p>Here’s the main point: the $100 you spent on the museum ticket is a sunk cost. It’s already gone and cannot be refunded, no matter what decision you make. Therefore, the amount you paid should not influence your choice. If you’d rather enjoy the sunny day, then that should be your focus, because the money spent on the ticket is gone either way. if you decide to go to the museum simply because you paid for it knowing that it is the option that brings less happiness, you’re making an irrational decision. In that case, you’re choosing the outcome that makes you less happy, even though both options (museum or sunny day) <em>cost</em> you the same amount in the end. The goal should be to maximize your happiness (short and long term) and make decisions that reflect that, without being influenced by sunk cost.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">A <strong>sunk cost</strong> is a cost that has already been incurred and cannot be recovered.</div>
</div>

<p>On the other hand, one could argue that attending the museum might bring satisfaction simply because you're making use of something you already paid for. For some, the thought of not utilizing something they've spent money on can trigger negative feelings, making attending the museum feel like the better option, even if it's not the choice that maximizes overall happiness.</p>
<h3 id="heading-concert-ticket-example">Concert Ticket Example</h3>
<p>Now, let’s go back to the poll at hand: how much does the concert cost you if you decide to attend it knowing that it’s valued at $10,000. It’s important to position ourselves at the time where you’ve already purchased the ticket (non-refundable), and you’re making the decision to either (1) attend the concert, or (2) sell it for $10,000. Let's examine each answer and demonstrate how a case can be made to support the validity of each statement.</p>
<p><strong>You spent $0 if you attend</strong>: Since you already paid for the $1,000 (sunk cost), it’s costing you nothing to go to the concert.</p>
<p><strong>You spent $1,000 if you attend</strong>: You paid $1,000 for the concert, and you attend, so it simply cost $1,000 (not considering sunk or opportunity costs).</p>
<p><strong>You spent $10,000 if you attend:</strong> You have an opportunity to attend which would cost you $0 since the cost is sunk, or to sell the ticket and make $10,000 (opportunity). So attending is synonymous with spending $10,000.</p>
<p>A renowned data-scientist commented on my post suggesting that $9,000 could be the answer. This could make sense from a net-worth perspective, since selling the ticket would effectively increase you’re net worth by $9,000 (spend $1,000 to purchase the ticket, and earn $10,000 by selling it).</p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>So, why does this seemingly simple question spark such varied answers? The key takeaway is that our understanding of costs, and how we make decisions, is deeper than just the money spent. By considering opportunity cost and sunk costs, we gain a more nuanced perspective that can change the way we approach not only financial decisions but everyday choices as well.</p>
<p>Think about how these concepts can influence your decision-making in areas like investing, career choices, or even personal relationships. For example, when holding on to a losing investment, many people fall into the trap of “I’ve already lost this much, so I’ll hold on until it recovers”, without considering whether it’s the best use of their resources now, <em>today</em>. Understanding that past decisions (sunk costs) shouldn’t influence present ones can allow you to make choices that truly align with your short and long term goals.</p>
<p>In the end, making use of concepts like opportunity cost and sunk cost can help us make decisions that will bring us closer to our long-term happiness, goals, and success. So the next time you’re faced with a decision, ask yourself: “<strong>What’s the real cost here?</strong>”. You might find that a seemingly simple choice has more to it, and that awareness could lead to a more thoughtful, informed decision.</p>
]]></content:encoded></item><item><title><![CDATA[Pedalling Through Montreal's BIXI Data]]></title><description><![CDATA[In our previous article, we discussed the risks of oversimplification and showed how common statistics like the average and standard deviation can be misleading if the raw data is not properly examined first. We’d like to continue this discussion by ...]]></description><link>https://mathieutorchia.com/pedalling-through-montreals-bixi-data</link><guid isPermaLink="true">https://mathieutorchia.com/pedalling-through-montreals-bixi-data</guid><category><![CDATA[BIXI]]></category><category><![CDATA[Python]]></category><category><![CDATA[Montreal]]></category><category><![CDATA[demand]]></category><dc:creator><![CDATA[Mathieu Torchia]]></dc:creator><pubDate>Sat, 16 Nov 2024 18:06:22 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1731698569789/436572ca-f81c-4e3b-ad11-1791a2d9cf81.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In our previous article, we discussed the risks of oversimplification and showed how common statistics like the average and standard deviation can be misleading if the raw data is not properly examined first. We’d like to continue this discussion by using a practical example, while also learning more about a Montreal staple: our BIXI bikes. In this article, we will <strong>analyze the demand for BIXI bikes per hour of the day</strong>. This work will lay the foundation for a future blog, where we will construct some basic models to forecast the future hourly demand of BIXI depending on multiple factors.</p>
<h3 id="heading-the-big-picture-what-does-hourly-demand-look-like">The Big Picture: What Does Hourly Demand Look Like?</h3>
<p>Let’s start by looking at the bigger picture: <strong>how many rides commence within each hour of the day</strong>? We can aggregate our data to find the total number of rides that occurred at 5AM, at 1PM, at 8PM, or at any time of the day. This gives us a very clean looking graph:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731697183877/77e88ae7-d8bd-49a9-b3f4-66a08da02d46.png" alt class="image--center mx-auto" /></p>
<p>We can get a good sense of the demand behaviour by looking at this line chart. There are a few things to point out:</p>
<ul>
<li><p>Peak demand is between 5PM and 6PM, with almost 1,200,000 rides starting between those times.</p>
</li>
<li><p>There seems to be a “local” peak in the morning before 9AM</p>
</li>
<li><p>Very little demand occurs in the middle of the night (&lt;100,000 rides)</p>
</li>
</ul>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">These results make sense when we consider typical<strong> commuting patterns</strong>, with peaks aligning to BIXI users heading to work <strong>around 9 AM</strong> and returning <strong>around 5 PM</strong>.</div>
</div>

<p>While this is a good start, there is a lot of <strong>missing</strong> or <strong>misleading</strong> information coming from this graph. We lack details on how this trend changes with the seasons, weather, day of the week, BIXI station, and probably many other important features. It can also be misleading since there is no concept of how representative this trend is of a normal day in Montreal. For example, we stated that the peak was between 5 PM and 6 PM, but what if this is true for weekdays and completely wrong for weekends? The graph above does not allow us to understand the more nuanced aspects. Let’s dive a bit deeper!</p>
<h3 id="heading-a-closer-look-weekdays-vs-weekends">A Closer Look: Weekdays vs. Weekends</h3>
<p>Out of the missing features enumerated above, we will start by diving deeper into the difference in demand patterns depending on the day of week. Specifically, <strong>how does the demand for BIXIs differ between weekdays and weekends?</strong></p>
<p>Let’s rerun the same graph as before, while changing two important parts:</p>
<ol>
<li><p>Graph two lines, one representing the week (purple), and another representing the weekend (green)</p>
</li>
<li><p>Instead of showing the <strong>total</strong> BIXI rides across the entire year, show the <strong>average</strong> rides per day, as this produces numbers that are easier to digest and more relevant for the reader</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731697214629/59ec2ee9-fdc0-44bf-9c08-07002a8c4991.png" alt class="image--center mx-auto" /></p>
<p>While the weekday trend looks similar to the first graph, the weekend trend clearly differs. Here are some characteristics of the weekend trend that differ from the weekday trend:</p>
<ul>
<li><p>There is only one peak, a little earlier than the weekday peak</p>
</li>
<li><p>The peak is much lower than the weekday peak (3,500 vs 5,000)</p>
</li>
<li><p>There are more riders riding throughout the night</p>
</li>
</ul>
<p>While the first graph gave a good indication of the general demand for BIXIs, it definitely failed at showing this divergence between week and weekend. We can go even further to see how this relationship would change at the day of week level, which we can do with a simple heat map:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731697231060/e04c8678-dd5e-4b20-b2db-ad446e2f9e51.png" alt class="image--center mx-auto" /></p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">In this heat map, the x-axis shows the hour of the day, the y-axis represents the days of the week, and the shading of the squares indicates the average number of riders, with darker squares reflecting higher ridership.</div>
</div>

<p>We can see many of the characteristics that we saw in the previous graphs, such as weekday peaks around 8AM and 5PM, gradual weekend peaks around 3PM, and low ridership during the middle of the night. There are also some additional insights to gain from this illustration:</p>
<ul>
<li><p>As we move from Monday to Friday, we tend to see a bit more users later at night (from 10PM to 3AM), and then a clear increase in late night rides on Saturday and Sunday</p>
</li>
<li><p>Tuesdays, Wednesday, and Thursdays have the highest peaks, which can be shown by the darker squares at 8AM and 5PM</p>
</li>
<li><p>Overall, while there are differences between the days, all weekdays tend to follow a similar pattern, and all weekends show a consistent behaviour as well.</p>
</li>
</ul>
<h3 id="heading-the-limitations-of-averages-are-these-numbers-representative">The Limitations of Averages: Are These Numbers Representative?</h3>
<p>While these visualizations allow for a deeper understanding of the demand for BIXIs, there are always ways to improve them. One of those ways can be to add some insights into meaningful are these averages? For example, consider these two sets of numbers:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>First Set</strong></td><td><strong>Second Set</strong></td></tr>
</thead>
<tbody>
<tr>
<td>234</td><td>555</td></tr>
<tr>
<td>1014</td><td>556</td></tr>
<tr>
<td>3142</td><td>558</td></tr>
<tr>
<td>4522</td><td>559</td></tr>
</tbody>
</table>
</div><p>While these two sets of numbers look like they have nothing alike, they have one thing in common: they have the same average of 557. However, everyone can agree that the average is much more meaningful and representative of the second set because the values are much closer to it than in the first set, e.g. the variance is lower. How can we introduce this type of analysis in our previous illustration?</p>
<h3 id="heading-understanding-percentiles-what-do-percentiles-tell-us">Understanding Percentiles: What Do Percentiles Tell Us?</h3>
<p>To better understand how well our average curves represent the demand for BIXIs in Montreal, we can add visuals for the 5th and 95th percentiles.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><strong>Percentiles</strong> divide a dataset into 100 equal parts, showing the value below which a given percentage of data falls. For example, the 5th percentile is the value below which 5% of data points lie. Percentiles help us understand the spread and variability of the data.</div>
</div>

<p>If the 5th and 95th percentiles are close to the average, it suggests that the demand curves are fairly consistent throughout the year, regardless of other factors. Let’s add these curves to the weekday and weekend charts.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731697248169/51e632d7-1b79-4c61-adaf-dc38b0a0f369.png" alt class="image--center mx-auto" /></p>
<p>Ouf. These lower and upper bounds are quite large. For example, if we look at the left graph, there is an average of ~5,000 riders at 5 PM on weekdays. However, if we take the number of riders for all the weekdays at 5PM, the middle 90% of those observations ranges from 300 to almost 8,000… This large range means that for the majority of the days, the demand for rides at 5PM can vary wildly: potentially due to other factors such as the weather and the season.</p>
<p>While the average still gives a great indication as to the general trend over the day, it does not necessarily mean that it is an accurate representation of any random day in Montreal.</p>
<p>Another interesting way to view this is by drawing a line for every single day in 2023, and overlaying it in one graph, which can be shown below.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1731697270097/ef991a45-47b0-4c4c-b13f-9ee1aaa4bc57.png" alt class="image--center mx-auto" /></p>
<p>The advantage of this type of graph is that it avoids averages entirely! We can see the raw number of rides for every single day, represented in a purple line for the weekday and a green line for the weekend. Here, while we do see a lot of the rides following the average trends as we showed before, there are still many other purple and green lines that seem to be going in a completely different direction.</p>
<h3 id="heading-bringing-it-all-together-whats-next-for-bixi-demand-analysis">Bringing it All Together: What’s Next for BIXI Demand Analysis?</h3>
<p>This dive into BIXI ridership data really shows why raw data matters—averages and totals only scratch the surface. By looking at hourly trends and splitting weekdays from weekends, we found patterns that would’ve stayed hidden otherwise. It’s clear that weekdays and weekends tell completely different stories, with weekday patterns varying a bit depending on the day. It all reflects how people use BIXIs for everything from commuting to leisure.</p>
<p>The percentiles added another layer to the story, showing just how much demand can swing depending on things like weather or season. For example, the weekday 5 PM peak might average around 5,000 riders, but on some days, it’s as low as 300 or as high as 8,000. It’s a great reminder that averages don’t always show the whole picture. The graph with daily lines made this even clearer—most days follow the general trend, but some are total outliers, going in completely different directions.</p>
<p>In our next blog, we’ll take this analysis and build on it to create a predictive model. We’ll go step-by-step to explore how a model can help us forecast demand and, more importantly, how diving into the details of modelling shows just how powerful it can be. By considering factors like weather and season, we'll see how accurately we can predict BIXI demand and learn a lot about modelling in the process.</p>
]]></content:encoded></item><item><title><![CDATA[The Cost of Simplification]]></title><description><![CDATA[Numbers are everywhere. Whether we hear them from U.S. presidential candidates, find them in articles during our searches, or see them in financial reports, they form a significant part of the information we consume. We live in an era where "Big Data...]]></description><link>https://mathieutorchia.com/the-cost-of-simplification</link><guid isPermaLink="true">https://mathieutorchia.com/the-cost-of-simplification</guid><category><![CDATA[anscombe]]></category><category><![CDATA[Python]]></category><category><![CDATA[statistics]]></category><category><![CDATA[#responsibleai]]></category><dc:creator><![CDATA[Mathieu Torchia]]></dc:creator><pubDate>Mon, 21 Oct 2024 23:30:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1729523243477/fe992c71-d89c-4265-bc2e-33f9e8549c5a.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Numbers are <strong>everywhere</strong>. Whether we hear them from U.S. presidential candidates, find them in articles during our searches, or see them in financial reports, they form a significant part of the information we consume. We live in an era where "Big Data" is everywhere, and the only way to make use of it is by simplifying it with data analytics. It’s not just in today’s world; we have always tried to express complex ideas in simpler ways. I’m reminded of the time in high school when we were told</p>
<blockquote>
<p>“You cannot take the square root of a negative number!”</p>
</blockquote>
<p>Only to find out years later that this was a lie, and that the square root of -1 is equal to the imaginary number “i”. Sometimes, lies like that are needed. If they tried to teach us everything about math during those years, without cutting corners, we would have been completely overwhelmed. However, there are cons when simplifying things. In this article, we'll explore a straightforward example that vividly illustrates how oversimplifying data insights can lead to misleading conclusions.</p>
<hr />
<p>As you know (or, in case you didn’t know), there are many reasons that I started this blog: one being to improve my coding and visualization skills. With that being said, let’s take a look at a seemingly busy (but colourful) scatter plot.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729539568228/f9c5ce90-f2f2-4ed7-8faa-ba1cbe043c75.png" alt class="image--center mx-auto" /></p>
<p>At first glance, this may seem like a random collection of points in a graph. However, after paying attention to each colour, it’s clear that they all follow some sort of pattern:</p>
<ul>
<li><p>The blue dots are following an upwards trend</p>
</li>
<li><p>The red squares seem to be forming an arc</p>
</li>
<li><p>The green triangles mostly follow a straight upwards line</p>
</li>
<li><p>The yellow diamonds almost all have the same x value of 8.</p>
</li>
</ul>
<p>Nonetheless, they are all clearly very different from one another.</p>
<p>In practise, there are not many useful data sets out there that only have 44 data points… Data sets can have thousands, millions, and even billions of rows of data, which makes visualizing the data often impossible or extremely difficult to read. Due to this, data analysts will try to simplify the data with measures such as the average, the standard deviation, the coefficient of correlation, etc. But, <strong>is there a cost to these simplifications</strong>? Let’s dive a bit deeper into the blue dots to help answer that question.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729539580440/ed2f25a8-c4e3-417d-87a8-66550f7d58a1.png" alt class="image--center mx-auto" /></p>
<p>In the graph above, we are plotting only the blue points. Additionally, we added a line that best fits the data (a linear OLS regression), and some common descriptive statistics at the bottom right hand side of the graph. From here, without even plotting the dots, we can get access to a lot of interesting information:</p>
<ul>
<li><p>The average y value is 7.50</p>
</li>
<li><p>The standard deviation of the y values is 1.94</p>
</li>
<li><p>x and y have a correlation coefficient of 0.82</p>
</li>
<li><p>The line of best fit follows the following equation:</p>
</li>
</ul>
<p>$$y=0.5x+3.0$$</p><p>Great! We've gained a lot of insights from this data by finding key statistics like the average and the correlation coefficient. We can now use these in our discussions or when forming opinions on certain topics, without needing to spend more time exploring the data or checking for anything potentially misleading.</p>
<p>Right?</p>
<p>… Right … ?</p>
<p>Unfortunately, no.</p>
<p>While simplifying data with key statistics like those above is very useful, it's also important to be cautious. These simplifications can sometimes hide important details or lead to incorrect conclusions. To illustrate this point, let's plot the same graph as above, but this time include all the different data points from the first graph.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1729539593539/859355da-5add-4bfe-9d0e-08cf11cfbbfd.png" alt class="image--center mx-auto" /></p>
<p>As shown above, if we draw the line of best fit for each data set, as well as the mean, standard deviation, and coefficient of correlation, <strong>we get the exact same result</strong>. This is called Anscombe’s quartet, which was constructed in 1973 by Francis Anscombe:</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><strong>Anscombe's quartet</strong> comprises four datasets that have nearly identical simple <strong>descriptive statistics</strong>, yet have very different <strong>distributions</strong> and appear very different when graphed.</div>
</div>

<p>I was amazed by this illustration, and had to post an article about it. It's one of those things that can be understood just by looking at the picture above, without needing many words to explain it.</p>
<p>The key takeaway here is to be careful whenever we attempt to simplify <strong>anything</strong>. There are benefits to simplification, but to enjoy these advantages, it's important to examine the raw data first to make sure nothing unusual is overlooked. This is more philosophical than mathematical: it’s always better to get information from the source than from a (potentially broken) telephone. If the source is unavailable, then make sure the telephone is a reliable one.</p>
<h1 id="heading-conclusion">Conclusion</h1>
<p>To wrap up, this example resonates beyond mathematics and data analysis; it extends to how we approach real-world issues. I was reminded of this during a talk at the ALLIN AI conference in Montreal last September, where a representative from the "Conseil du statut de la femme" discussed AI's potential risks concerning gender equality. When I asked about her thoughts on mandating companies to hire equal numbers of men and women, she highlighted a critical flaw: while a 50/50 gender ratio may appear balanced on the surface, it may obscure deeper issues—such as women being concentrated in lower-growth roles or positions with limited decision-making power.</p>
<p>The lesson here is clear: <strong>simplified metrics can paint an incomplete picture</strong>. Whether analyzing data or tackling social challenges, we must look beyond surface-level statistics to ensure we’re not missing important details.</p>
<blockquote>
<p><strong><em>Visit my GitHub page (</em></strong><a target="_blank" href="https://github.com/mathieutorchia/anscombe-quartet"><strong><em>link</em></strong></a><strong><em>) learn more about the python code that made this article possible.</em></strong></p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[Is Everything... Normal?]]></title><description><![CDATA[Visit my GitHub page (link) to learn more about the python code that made this article possible.

Do you know what really grinds my gears, riles me up, bothers me, and frustrates me? It may sound dramatic, but my biggest pet peeve is when people are ...]]></description><link>https://mathieutorchia.com/is-everything-normal</link><guid isPermaLink="true">https://mathieutorchia.com/is-everything-normal</guid><category><![CDATA[normal distribution]]></category><category><![CDATA[Python]]></category><category><![CDATA[Jupyter Notebook ]]></category><category><![CDATA[Central Limit Theorem]]></category><dc:creator><![CDATA[Mathieu Torchia]]></dc:creator><pubDate>Mon, 05 Aug 2024 03:55:47 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1722694270612/07bfe787-2e17-4618-9989-5d867807b6b3.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong><em>Visit my GitHub page (</em></strong><a target="_blank" href="https://github.com/mathieutorchia/central-limit-theorem"><strong><em>link</em></strong></a><a target="_blank" href="https://github.com/mathieutorchia/SP500-Investing-Project"><strong><em>)</em></strong></a> <strong><em>to learn more about the python code that made this article possible.</em></strong></p>
</blockquote>
<p>Do you know what really grinds my gears, riles me up, bothers me, and frustrates me? It may sound dramatic, but my biggest pet peeve is when people are 100% certain about something, only to be proven wrong. Here are two statements that are, in my very firm opinion, entirely different:</p>
<ul>
<li><p>"I am 100% sure the tennis courts close at 10PM, Mat"</p>
</li>
<li><p>"I am <em>pretty sure</em> the tennis courts close at 10PM, Mat"</p>
</li>
</ul>
<p>Don't get me wrong, people are free to use whichever statement they like. However, if the first statement is used and the tennis courts actually close at <em>11PM</em>, that person will lose all credibility, <strong>forever</strong>.</p>
<p>To maintain credibility, statisticians use various tools to make statements that are accurate and reliable. For example, it is not uncommon to hear things like the following:</p>
<ul>
<li><p>A researcher is studying the effect of a new drug on blood pressure. They conduct a study on a sample of patients and find that the new drug reduces blood pressure by between 6 mmHg and 10 mmHg, 95% of the time.</p>
</li>
<li><p>A polling organization surveys a sample of voters to estimate the support for a political candidate. They find there is a 90% chance that the candidate will receive between 49% and 55% of the votes.</p>
</li>
</ul>
<p>In this article, we will explore a key concept that enables statisticians to make precise and reliable statements like those mentioned above. Widely regarded as the backbone of inferential statistics, this concept is the <strong>Central Limit Theorem</strong> (CLT).</p>
<h3 id="heading-background">Background</h3>
<p>So, <em>what is the Central Limit Theorem, and why is it so important that it deserves to be the 3rd topic of this blog series?</em> Here are some examples of what we would have to do if the CLT <strong>did not exist</strong>:</p>
<ul>
<li><p>If we wanted to know which politician was ahead of the election, we would have to survey every single Canadian, instead of asking a smaller sample of people.</p>
</li>
<li><p>If we wanted to categorize a newly developed medication as <em>safe</em> or <em>effective</em>, we would have to test it on every single human being, instead of running a clinical trial and testing it on a smaller subset of people.</p>
</li>
</ul>
<p>These situations would cost an absurd amount of money, and take an insane amount of time to conduct. So, how does the CLT allow us to bypass this? First, let's look at the definition from <a target="_blank" href="https://www.investopedia.com/terms/c/central_limit_theorem.asp">Investopedia</a>:</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">The <strong>Central Limit Theorem (CLT)</strong> is a statistical premise that, given a sufficiently large sample size from a population with a finite level of variance, the <strong>mean of all sampled variables</strong> from the same population will be <strong>approximately equal to the mean of the whole population</strong>.</div>
</div>

<p>In other words, t<strong>he CLT gives us the ability to draw very firm conclusions about a population, by only doing analyses on a much smaller sample of that population!</strong></p>
<p>To illustrate this point, in the sections below, we will create a population and try to come to conclusions about that population by only analyzing a sample of it.</p>
<h3 id="heading-practical-example">Practical Example</h3>
<p>Let's imagine a very simple world, where there are 1,000,000 different people. Each person is assigned a number between 0 and 10. So one person might have the number 2.1423, and another might have 6.3245. In the form of a table, we would have something like this:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Person</td><td>Value</td></tr>
</thead>
<tbody>
<tr>
<td>1</td><td>2.1423</td></tr>
<tr>
<td>2</td><td>6.3245</td></tr>
<tr>
<td>3</td><td>3.2345</td></tr>
<tr>
<td>...</td><td>...</td></tr>
<tr>
<td>999,999</td><td>4.4152</td></tr>
<tr>
<td>1,000,000</td><td>0.9412</td></tr>
</tbody>
</table>
</div><p>If we were to plot the histogram representing this situation, we would obtain the following:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722694728286/7f74d939-d059-4a52-942d-948060d9d087.png" alt class="image--center mx-auto" /></p>
<p>This histogram shows that there are about 100,000 people with values between 0 and 1, another 100,000 people with values between 1 and 2, and so on for each range. Since we currently have access to all the data that makes up the population, we can easily compute the true mean and standard deviation:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td><strong>Type of Distribution</strong></td><td>Uniform</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Mean</strong></td><td>4.999</td></tr>
<tr>
<td><strong>Standard Deviation</strong></td><td>2.886</td></tr>
</tbody>
</table>
</div><p>However, let's say we are interested in finding the mean, but with one limitation: <strong>we do not have access to the entire population</strong>. Therefore, we need to find a way to make a good guess at what the true average is, by only looking at a <em>sample</em> of the population.</p>
<p>One way we could do that is by following these simple steps:</p>
<ol>
<li><p>Pick 100 people at random</p>
</li>
<li><p>Record the average of the group</p>
</li>
<li><p>Repeat steps 1-2 1000 times</p>
</li>
</ol>
<p>After performing steps 1 to 3, we will be left with 1000 averages. We can plot each of these averages in another histogram, which looks like the following:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722694756042/cbc625fd-1336-491c-b186-850060754937.png" alt class="image--center mx-auto" /></p>
<p>Let's take some time to understand what's going on here.</p>
<ol>
<li><p>The distribution looks rather symmetric and resembles a bell curve</p>
</li>
<li><p>The average is close to 5.0.</p>
</li>
<li><p>We never see an average above 6, or below 4.2.</p>
</li>
</ol>
<p>These 3 points all represent the Central Limit Theorem in action.</p>
<p><strong>First Point:</strong> The CLT says that the distribution of the sample means will resemble a normal distribution (bell curve). This is great news since statisticians are very familiar with this type of distribution, and can therefore easily extract information from it.</p>
<p><strong>Second Point:</strong> The CLT says that the distribution of the sample means will create a normal distribution around the true population mean. Since we already know that the true population mean is 4.999, we can see that the sampling distribution's mean is very close to it.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">In fact, the CLT states that (given a few requirements) the distribution of the sample means will always form a normal distribution, <strong>regardless of the initial distribution</strong>! In this example, the initial distribution was uniform, but it could be anything (as long as it has finite variance). This is another incredible property of the CLT.</div>
</div>

<p><strong>Third Point:</strong> The CLT states that as the sample size increases, the variance of the distribution of the sample mean becomes much smaller. To be more specific, the standard deviation of the distribution of the sample mean will be equal to the following (where <em>s</em> is the <strong>sample</strong> standard deviation, <em>σ</em> is the true <strong>population</strong> standard deviation, and <em>n</em> is the <strong>sample size</strong>):</p>
<p>$$s=\frac{\sigma}{\sqrt{n}}$$</p><p>This essentially says that as your sample size gets larger (<em>n</em>), the standard deviation of the sample (<em>s</em>) gets smaller, and so the distribution gets narrower and narrower around the true population mean. To illustrate this point, let's compare the same situation above, except with <em>n=3</em>, <em>n=10</em>, and <em>n=100.</em></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722694764951/aa4ea4d0-c239-4850-95dc-60ccadc30d1d.png" alt class="image--center mx-auto" /></p>
<p>In the above graph, we can see that the three distributions are all centred around 5 (the true population mean). However, we can see that when the number of averages that we take (<em>n</em>) goes from 3 to 10 to 100, the distributions get narrower and narrower, and get a lot more concentrated around the true population mean.</p>
<h3 id="heading-learnings">Learnings</h3>
<p>This is fantastic because we only had to analyze a much smaller portion of the true population data to figure out that we're pretty confident that the true population mean has to be somewhere near 5. But, how sure can we be? Can we be 100% sure that the average is 5? Or 95% sure? How SURE are we? Let's introduce one more concept: confidence intervals.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">A confidence interval is <strong>a range of values, bounded above and below the statistic's mean</strong>, that likely would contain an <em>unknown population parameter</em>. (Source: <a target="_blank" href="https://www.investopedia.com/terms/c/confidenceinterval.asp">Investopedia</a>)</div>
</div>

<p>In this case, the <em>unknown population parameter</em> would be the mean. Let's create one more plot to help illustrate this point. Before creating the plot, we will create 100 histograms like the one above, with each histogram having a different value for <em>n</em> (from 1 to 100). We will then plot the estimated mean from each histogram as well as our confidence intervals.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1722694776120/442990c1-3e45-4f8c-a5f2-dcef34be57e3.png" alt class="image--center mx-auto" /></p>
<p>When the sample size is low, we can see that the estimated mean is still rather close to 5.0 (as is shown in the solid green line), however, our confidence interval is pretty <em>thick</em>, from 3.5 to 6.5 (as is shown in the blue shaded area). This essentially means that we are 95% sure the true population mean is somewhere between 3.5 to 6.5. Unfortunately, <strong>this is quite a large gap</strong> and is <strong>not very meaningful</strong>. However, if we look at the confidence interval when <em>n</em> = 100, we are 95% sure that the true population mean must be somewhere between 4.92 and 5.08, which is <strong>a lot more precise than before</strong>! In practice, this means that the larger the sample size, the more precise our findings will be.</p>
<p>Another interesting thing to note is that <strong>as the sample size increases, the standard deviation initially drops significantly but then begins to plateau once the sample size becomes sufficiently large</strong>. This behaviour is explained by the formula shared above where the sample standard deviation is equal to the population standard deviation <strong>divided by the square root of the sample size</strong>. This relationship results in a curve that initially declines sharply and then levels off as the sample size continues to grow. From this, we can take that while increasing the sample size does improve the precision of our estimates, there are <strong>diminishing returns after a certain point</strong>. This means that beyond a certain sample size, the benefit of adding more data points becomes minimal.</p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>In summary, the Central Limit Theorem is a statistical powerhouse that lets us make reliable conclusions about entire populations by examining just a small sample. This theorem ensures that, with a large enough sample size, our sample means will dance around the true population mean in a familiar bell-shaped curve.</p>
<p>So, next time someone tells you they are 100% certain about something, you can <em>gently</em> remind them of the beauty of the CLT and the importance of confidence intervals. After all, in statistics and in life, it's not just about being sure, but about knowing how sure you are.</p>
<p>By embracing the principles of the Central Limit Theorem, we can save time, money, and a whole lot of effort while maintaining credibility and making well-informed decisions. Now, armed with this knowledge, you can approach data analysis with confidence and precision, knowing that the CLT has got your back. And remember, always be a little skeptical of anyone who is 100% certain—they probably haven't met the CLT yet!</p>
<blockquote>
<p><strong><em>Visit my GitHub page (</em></strong><a target="_blank" href="https://github.com/mathieutorchia/central-limit-theorem"><strong><em>link</em></strong></a><a target="_blank" href="https://github.com/mathieutorchia/SP500-Investing-Project"><strong><em>)</em></strong></a> <strong><em>to learn more about the python code that made this article possible.</em></strong></p>
</blockquote>
]]></content:encoded></item><item><title><![CDATA[Exponential Growth: Investing and Inflation]]></title><description><![CDATA[Visit my GitHub page (link) to learn more about the python code that made this article possible.

Isn't exponential growth a mind-blowing concept? I remember first learning about it and not thinking much of it. However, the more I thought about it, t...]]></description><link>https://mathieutorchia.com/exponential-growth-investing-and-inflation</link><guid isPermaLink="true">https://mathieutorchia.com/exponential-growth-investing-and-inflation</guid><category><![CDATA[finance]]></category><category><![CDATA[Investment]]></category><category><![CDATA[s&p500]]></category><category><![CDATA[Python]]></category><category><![CDATA[research]]></category><dc:creator><![CDATA[Mathieu Torchia]]></dc:creator><pubDate>Mon, 01 Jul 2024 16:00:55 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1719685209843/7e8829d8-2282-4167-b198-2823f4172e30.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>Visit my GitHub page (<a target="_blank" href="https://github.com/mathieutorchia/SP500-Investing-Project">link</a>) to learn more about the python code that made this article possible.</p>
</blockquote>
<p>Isn't exponential growth a mind-blowing concept? I remember first learning about it and not thinking much of it. However, the more I thought about it, the more I realized how common it is around us, and how powerful it can be. It's easy to say things like: "<em>Only 42 people have COVID in the United States, what's the big deal?</em> "or "<em>Sure I'll sign the mortgage, the 5% fixed interest rate is not that bad." or "I'll keep my money under my mattress, I don't want to risk it in the stock market"</em>. At first glance, these statements seem reasonable. However, when we consider their compounding nature, they take on a completely different meaning:</p>
<ul>
<li><p>In March 2020 (in the US), the number of COVID cases jumped from 42 at the start of the month to 185,000 by the end (a 4400x increase!).</p>
</li>
<li><p>Even if the fixed interest rate is set at 5%, mortgages usually last for many decades. This means that 5% is compounded yearly and could end up growing to over 100% of the home's value.</p>
</li>
<li><p>If you kept your money under your mattress from 1983 to 2024, it would be worth a third of its initial value (in real terms).</p>
</li>
</ul>
<p>While there are countless examples of exponential growth that affect us in our lives, we will be focusing on it in the context of investing in the stock market. In this article, <strong>we will explore the long-term effects of periodically investing in the stock market</strong>, as well as <strong>how inflation affects your bottom line</strong>.</p>
<h3 id="heading-inflation">Inflation</h3>
<p>Before getting into the importance/benefits of investing in the stock market, let's explore the <em>worst</em> thing that was ever invented: inflation.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><strong>Inflation</strong> is the term used to describe how the money we have today will be worth less in the future.</div>
</div>

<p>One of the ways we track inflation is by looking at the consumer price index (CPI), which <strong>measures the average change over time in the prices paid by consumers for a basket of goods and services</strong>. For example, if inflation was 3% last year, this means that, on average, something that used to cost $100 will now cost $103. In other words, your money is losing value... every. day.</p>
<p>So, how bad is it? We can plot the CPI in the United States from 1947 to 2024 to help answer that question.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1719626639204/c1158a85-9580-4a36-a607-66f4ccdc87c3.png" alt class="image--center mx-auto" /></p>
<p>This shows the CPI to be 100 in 1983, in contrast to roughly 300 in 2024. This means that $100 of typical expenses in 1983 would cost 3x that amount in 2024!</p>
<p>At first, this is terrifying. Thankfully, there is a way to bypass some of the negative impacts that come with inflation, and that is for investors to place their money somewhere it will appreciate in value: real estate, private lending, <s>loansharking</s>, or investing in the stock market.</p>
<h3 id="heading-the-stock-market">The Stock Market</h3>
<p>There are many ways to invest in the stock market. For the purpose of this article, we will be looking at investing in an index fund (like <code>TSE: XSP</code>) that mimics the Standard &amp; Poor's 500. The S&amp;P 500 tracks the performance of the 500 largest companies in the United States. In fact, these top 500 companies make up roughly <strong>80% of the total U.S. equity market capitalization</strong> (source: <a target="_blank" href="https://www.morningstar.ca/ca/news/185437/sp-500-or-total-stock-market-index-for-us-exposure.aspx#:~:text=Stocks%20in%20the%20S%26P%20500,the%20presence%20of%20smaller%20stocks.">Morning Star</a>). Investing in an index fund that includes all the companies in the S&amp;P 500 is common advice, even from people like Warren Buffet:</p>
<blockquote>
<p>"In my view, for most people, the best thing to do is own the S&amp;P 500 index fund. The trick is not to pick the right company. The trick is to essentially buy all the big companies through the S&amp;P 500 and to do it consistently and to do it in a very, very low-cost way".</p>
</blockquote>
<p>Now, let's look into the outcome of investing in the S&amp;P 500 for 40 years, from 1983 to the end of 2023.</p>
<h3 id="heading-investing-for-40-years-increasing-monthly-investments">Investing for 40 Years - Increasing Monthly Investments</h3>
<p>Let's imagine an investor (we'll call her Sabrina) who is about to embark on a lifetime of textbook S&amp;P investing:</p>
<ul>
<li><p>She has purchased $50,000 worth of shares (at the start of 1983).</p>
</li>
<li><p>She is willing to invest an additional $2,000 per month from 1983 to 2023.</p>
</li>
<li><p>She is willing to <strong>increase her monthly investment by the inflation rate</strong>. For example, if the inflation rate is 1%, then instead of investing $2,000, she'll invest $2,020.</p>
</li>
<li><p>When she receives dividends (once per year), she will automatically reinvest them into the index.</p>
</li>
</ul>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">A <strong>dividend </strong>is a small payment made by a company to its shareholders. In the context of typical indices that track the S&amp;P 500, the yearly dividend was <strong>between 1% and 4%</strong> from 1985 to 2024.</div>
</div>

<p>It's as simple as it gets as far as investing goes. It will be interesting to take a look at two main metrics: her <strong>nominal net worth</strong>, and her <strong>real net worth</strong>.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><strong>Nominal money</strong> is the amount of money measured in current dollars without adjusting for inflation, whereas <strong>real money</strong> accounts for inflation by measuring how valuable your money is in today's terms.</div>
</div>

<p>Essentially, the most important number to consider is the <strong>real</strong> net worth. When planning for the future, we need to understand how much our money will be worth in the future. Therefore, it's crucial to account for inflation and focus on the real net worth. Nonetheless, we can plot both Sabrina's real and nominal net worth after 40 years of investing in the S&amp;P 500.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1719637045237/f1383208-9fff-4b03-ac5a-7facfb4ad706.png" alt class="image--center mx-auto" /></p>
<p>In the lighter blue, we can see that Sabrina will have a net worth of $17,100,000, which is equivalent to enjoying a 10% year-over-year (YoY) return. However, as explained previously, life (on Earth) is a lot more expensive in the future, and inflation was able to grow at an exponential rate. Once we take into account inflation, we get the darker blue curve, which illustrates that Sabrina has $8,200,000 of "real" money (equivalent to a little less than 7% YoY return).</p>
<p>There are a couple interesting things to note here:</p>
<ul>
<li><p>The <strong>gap</strong> between the light blue (nominal) and the dark blue (true) lines <strong>seems to get wider and wider as time goes on</strong>, even though she is increasing her monthly investments to follow the inflation rate. This showcases the magnitude of the inflation rate, which cuts Sabrina's spending power in half. More on this below.</p>
</li>
<li><p>This method of investing is similar to enjoying a 7% YoY return. When people say that investing in the S&amp;P yields roughly 10% per year, this is only true at the nominal level, and therefore doesn't mean much.</p>
</li>
</ul>
<p>For those who prefer tables, we can see the growth in both nominal and real terms, as well as the percentage difference between the two.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Year</td><td>Nominal Net Worth</td><td>Real Net Worth</td><td>Percentage Difference</td></tr>
</thead>
<tbody>
<tr>
<td>Year 0</td><td>$50,000</td><td>$50,000</td><td>0%</td></tr>
<tr>
<td>Year 10</td><td>$680,000</td><td>$538,000</td><td>-21%</td></tr>
<tr>
<td>Year 20</td><td>$2,630,000</td><td>$1,780,000</td><td>-32%</td></tr>
<tr>
<td>Year 30</td><td>$6,350,000</td><td>$3,830,000</td><td>-40%</td></tr>
<tr>
<td>Year 39</td><td>$17,120,000</td><td>$8,180,000</td><td>-52%</td></tr>
</tbody>
</table>
</div><p>As explained by the first bullet point above, inflation does not seem to matter much in the first 10 years. The difference between nominal and real net worth is only about 21% (or $142,000). However, once we allow a lot more time to pass (39 years), we see that inflation has wiped out almost half of Sabrina's spending power (which amounts to more than $8,000,000!). Even though the average inflation rate was around 3.5% in the United States from 1985 to 2024, it was solely responsible for a 50% decrease in Sabrina's spending power...</p>
<h3 id="heading-investing-for-40-years-stagnant-monthly-investments-and-no-dividends">Investing for 40 Years - Stagnant Monthly Investments and No Dividends</h3>
<p>Let's imagine two other investors (we'll call them Mark and Donald), who are going to follow the same investing rules as Sabrina, except for two things:</p>
<ul>
<li><p>Mark is <strong>not willing to increase his monthly investment by the inflation rate</strong>. He will always invest $2,000 per month for the next 40 years.</p>
</li>
<li><p>Donald is <strong>not willing to reinvest his yearly dividends</strong> and decides to spend them on luxury goods instead.</p>
</li>
</ul>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><strong>Side note</strong>: I called <code>M</code>ark "<code>M</code>ark" because he doesn't increase his <code>M</code>onthly investments, and <code>D</code>onald "<code>D</code>onald" because he spends his yearly <code>D</code>ividends.</div>
</div>

<p>To summarize the differences between Sabrina, Mark, and Donald, we can refer to this table:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td>Sabrina</td><td>Mark</td><td>Donald</td></tr>
</thead>
<tbody>
<tr>
<td>Starting Investment</td><td>$50,000</td><td>$50,000</td><td>$50,000</td></tr>
<tr>
<td>Monthly Investment</td><td>$2,000</td><td>$2,000</td><td>$2,000</td></tr>
<tr>
<td>Reinvesting Dividends</td><td>Yes</td><td>Yes</td><td>No</td></tr>
<tr>
<td>Increase Monthly Investments</td><td>Yes</td><td>No</td><td>Yes</td></tr>
</tbody>
</table>
</div><p>What will be the impact of Sabrina and Donald increasing their investments to keep up with inflation compared to Mark, who did not? What about the impact of Donald consistently taking out his dividends to spend on other things? We can look at the next graph.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1719679483831/b6eb1c77-fdfc-4c5c-bad2-5d8114956eb6.png" alt class="image--center mx-auto" /></p>
<p>As we already know, Sabrina ends up with roughly $8,200,000 of real net worth. This is compared to Mark's $5,900,000 and Donald's $5,100,000. If the primary goal is to save the most money for retirement, the moral of the story is quite clear:</p>
<ul>
<li><p>For the periods between 1985 and 2024, investors who increased their investments by the going inflation rate had a large advantage compared to those who did not. In the short term, this meant increasing the yearly amount by only 3% (ish), which is equivalent to $60 (if we're investing $2,000 per month). Those seemingly minor incremental increases (like the $60) was equivalent to over $2,000,000 in the long run.</p>
</li>
<li><p>For the periods between 1985 and 2024, even though dividends were between 1% and 4%, reinvesting instead of spending them would have a monstrous impact, increasing total net worth by 61%!</p>
</li>
</ul>
<h3 id="heading-conclusion">Conclusion</h3>
<p>Exponential growth is a powerful concept that should significantly impact our financial decisions, especially when it comes to investing and dealing with inflation. By understanding how inflation eats away at your cumulative wealth over time, we can make more informed choices about where to place our savings.</p>
<p>Investing in the stock market, especially in index funds like the S&amp;P 500, can help reduce the negative effects of inflation and grow our wealth over time. As shown above, two key strategies to combat inflation are (1) consistently increasing investments to match inflation and (2) reinvesting dividends. By following these practices, and assuming similar market behaviour in the future, investors can expect upwards of 60% more retirement savings after 35+ years compared to not using these strategies.</p>
<blockquote>
<p>Visit my GitHub page (<a target="_blank" href="https://github.com/mathieutorchia/SP500-Investing-Project">link</a>) to learn more about the python code that made this article possible.</p>
</blockquote>
<p><strong>Disclaimer</strong>: The information provided in this article is for general informational purposes only and is based on historical data of the S&amp;P 500 from 1983 to 2024. Past performance is not indicative of future results, and the financial markets are subject to various risks and uncertainties. This article does not constitute financial advice and should not be taken as such. Readers are encouraged to conduct their own research and consult with a qualified financial advisor before making any investment decisions. The author and publisher of this article are not responsible for any financial losses or damages incurred from following the information presented herein.</p>
]]></content:encoded></item><item><title><![CDATA[Blackjack: Is It Beatable?]]></title><description><![CDATA[This is my first blog and first Python coding project. I have always been fascinated with casino games, especially Blackjack and Texas Hold'em. The appeal of Texas Hold'em is that if played correctly, it can be beaten. Why? Because your opponents are...]]></description><link>https://mathieutorchia.com/blackjack-is-it-beatable</link><guid isPermaLink="true">https://mathieutorchia.com/blackjack-is-it-beatable</guid><category><![CDATA[Python]]></category><category><![CDATA[blackjack]]></category><category><![CDATA[simulation]]></category><category><![CDATA[casino]]></category><category><![CDATA[data analysis]]></category><dc:creator><![CDATA[Mathieu Torchia]]></dc:creator><pubDate>Tue, 11 Jun 2024 04:22:25 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1717960418739/5dc9cade-58ec-4a7c-bad4-fac6566645be.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is my first blog and first Python coding project. I have always been fascinated with casino games, especially Blackjack and Texas Hold'em. The appeal of Texas Hold'em is that if played correctly, it can be beaten. Why? Because your opponents are humans, and humans make mistakes. All I had to do was capitalize on their mistakes to make money in the long term. But Blackjack... Your opponent in Blackjack is not a potentially drunken human staying up too late on a Sunday night. Your opponent is the infamous <em>house</em>, which has supposedly made all the calculations to ensure they will win in the end. No matter how well you play, <strong>you will lose in the long run, they say</strong>. So, I decided to turn my first coding project into a Blackjack simulator to see if that statement holds true.</p>
<hr />
<h3 id="heading-player-and-dealer-logic">Player and Dealer Logic</h3>
<p>This section will explain the various choices that were made to make a good attempt at modelling the game of Blackjack. Here are the main important points:</p>
<ul>
<li><p>There is an ∞ number of cards (we do not assume a finite number of decks)</p>
</li>
<li><p>The dealer stands on soft 17</p>
</li>
<li><p>A blackjack pays 3 to 2 (unless the player split aces)</p>
</li>
<li><p>The player can double</p>
</li>
<li><p>The player can split equal cards as many times as they'd like</p>
</li>
<li><p>The player can only split aces once, and can only receive one additional card for each ace</p>
</li>
<li><p>The player cannot surrender</p>
</li>
<li><p>The player plays the <em>optimal strategy</em> as shown in <a target="_blank" href="https://www.blackjackapprenticeship.com/blackjack-strategy-charts/">this</a> diagram.</p>
</li>
<li><p>The player bets $10 per hand</p>
</li>
</ul>
<h3 id="heading-simulation-logic">Simulation Logic</h3>
<p>Once we have the logic of the game properly coded, the rest is easy: <em>just press run 100,000 times and manually record the results with a pen and paper</em>. Kidding, of course. We can simply write a little bit of code that will run the game as often as we'd like, while recording the relevant information. But what kind of "relevant information" do we need? We decided to record the following information for every hand that was played:</p>
<ul>
<li><p>The sum of the player's first two cards</p>
</li>
<li><p>The sum of all the player's cards (when he is done playing)</p>
</li>
<li><p>The dealer's first card</p>
</li>
<li><p>The sum of all the dealer's cards (when he is done playing)</p>
</li>
<li><p>A boolean where <em>true</em> signifies that the player split his cards during this specific hand, and <em>false</em> signifies that the player did not split his cards</p>
</li>
<li><p>The result (W, L, T)</p>
</li>
<li><p>The money won or lost</p>
</li>
<li><p>The cumulative total money won or lost for a given player</p>
</li>
</ul>
<p>There are three other things that we record for every hand, which can be confusing to explain, so I would like to take the time to elaborate on it here. Here they are:</p>
<ol>
<li><p>The simulation number (one game)</p>
</li>
<li><p>The hand number</p>
</li>
<li><p>The meta simulation number (one player)</p>
</li>
</ol>
<p>The <strong>simulation number</strong> is used to track which iteration of the game we are currently on. For example, let's say the player is playing his 54th game (in other words, his 54th simulation), and is showing an [8,8]. He then decides to take one more card and gets to [8,8,5] for a total of 21. In this case the simulation number would be 54 since this was the 54th simulation for this particular player.</p>
<p>Using the example from above, his <strong>hand number</strong> would be "0", since this was the first hand he played in the 54th simulation number. Most of the time, the hand number will be "0", since players usually play one hand per game. However, for example, if a player is showing an [8,8] and decides to split his cards, he will then be given 2 hands: [8,3] and [8,5]. Now the [8,3] will be categorized as hand "0", and the [8,5] will be categorized as hand "1".</p>
<p>Finally, the <strong>meta simulation number</strong> categorizes a given player. As you'll see in the next section, we simulate the game of Blackjack for 100 different players, or, in other words, we have 100 different meta simulations. Let's say player 3 is playing his 55th game and has a [6,5,10], his meta simulation number would be 3, the simulation number would be 55, and the hand number would be 0.</p>
<p>Putting everything together, we get a table that looks like this:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717993361272/ed324eeb-7b8c-417a-b06f-099cf86e2de6.png" alt class="image--center mx-auto" /></p>
<p>To help read it, the first row shows that for meta simulation number 1 (or player 1), for their first game (first simulation number), and for their first hand (hand 0), their first two cards gave them a total of 16. However, since the dealer was showing a 10, this made them have to "hit" which put them at a total of 26. The final result is a "L (since their sum went above 21) and so the player lost $10, and is currently sitting at -$10 in cumulative earnings.</p>
<hr />
<h3 id="heading-results-first-simulation">Results - First Simulation</h3>
<p>The first simulation runs the Blackjack game 100 times (100 <strong>simulations</strong>), records the results, and then does this 100 times (100 <strong>meta simulations</strong>). Therefore, in total, there are 10,000 simulations being played (with 10,293 hands since the player split a total of 293 times throughout the simulation). The purpose of running the simulation this way is twofold:</p>
<ol>
<li><p>To mimic a scenario where a single person plays for 2 hours, which comes down to approximately 100 hands</p>
</li>
<li><p>To simulate observing 100 different people, each playing 100 hands</p>
</li>
</ol>
<p>With each coloured line representing a given player in the figure below, we plot the total profit per player over 100 hands, assuming the player bets $10 per hand:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717916027481/c39df27d-66db-46cf-baba-87164a7a76c1.png" alt class="image--center mx-auto" /></p>
<p>It's interesting to note that the graph doesn't clearly show that the house has an edge. There are a lot of players that are above the $0 line (meaning they secured a profit at the end of the 100 simulations). In fact, in this simulation, we find that out of the 100 players:</p>
<ul>
<li><p>45 made a profit , where the average profit was $0.87 <strong>per hand</strong></p>
</li>
<li><p>54 made a loss, where the average loss was $0.87 <strong>per hand</strong></p>
</li>
<li><p>1 broke even</p>
</li>
</ul>
<p>From the points above, it looks like the player wins 45% of the time. However, this is not the case, since the following shows the amount of individual hands that were won, lost, tied, and the average profit:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Result</td><td>Count</td><td>Count (%)</td><td>Average Profit</td></tr>
</thead>
<tbody>
<tr>
<td>Win</td><td>4,449</td><td>43.22%</td><td>$11.84</td></tr>
<tr>
<td>Loss</td><td>4,895</td><td>47.56%</td><td>-$10.93</td></tr>
<tr>
<td>Tie</td><td>949</td><td>9.22%</td><td>$0</td></tr>
<tr>
<td><strong>Total</strong></td><td><strong>10,293</strong></td><td><strong>100%</strong></td></tr>
</tbody>
</table>
</div><p><strong>QUESTION</strong>: Even though the Player wins at a lower rate than the dealer (43.22% vs 47.56%), the average gain is greater than the average loss ($11.84 vs $10.93). So what's the final result? Are we profitable?</p>
<p>To answer that question, we can plot the distribution of the <strong>average profit per hand</strong> for every player. Essentially, we look at a given meta simulation <em>i</em> (which represents a player), and we apply the following formula:</p>
<p>$$\frac{\text{(Total Profit or Loss)}_i}{\text{(Total Number of Hands)}_i}$$</p><p>So, if a player ended with a profit of $50 after 100 simulations (and 105 hands), then his average profit per hand would be:</p>
<p>$$\frac{$50}{105\text{ hands}} =$0.48\text{/hand}$$</p>
<p>We do this for each player (each meta simulation), and we plot them in the following bar chart:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717919095116/94e20971-f15a-4b32-abb9-0acd4cdd3af4.png" alt class="image--center mx-auto" /></p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Keep in mind these values are at the "per hand" level. So if the average loss per hand is $3, then this would come to roughly $300 total loss if a player played 100 hands.</div>
</div>

<p>In the plot above, we can see at the most left point, for example, there are roughly 2.5% of players that ended their night with an average loss per hand between -$3 and -2$. However, the red dotted line shows that the average result is a $0.08 loss. We can now answer the question from above.</p>
<p><strong>ANSWER</strong>: There are two opposing dynamics: the dealer tends to win more hands, but when the player wins, their winnings are typically larger. The question is, which dynamic has a greater impact overall? Unfortunately, even though the player wins more money during a winning hand, <strong>the dealer wins too many hands for the player to remain profitable</strong>, which is why we see an average loss of $0.08 as shown in the figure above.</p>
<p>You may be wondering, isn't the sample size of 10,000 total simulations a little small? Maybe this is just a fluke? You may be correct, so let's run this simulation 1,000,000 times.</p>
<h3 id="heading-results-second-simulation">Results - Second Simulation</h3>
<p>In this new simulation, we keep everything the same, except we allow each of the 100 players (meta_simulations) to play 10,000 simulations (instead of 100). This will give us a more accurate depiction of the long-term reality when playing Blackjack.</p>
<p>Let's start by plotting the profit per player across the 10,000 simulations:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717956858777/5e3bbc4f-741b-452a-ab8b-d410f1db586c.png" alt class="image--center mx-auto" /></p>
<p>In contrast to the first time we ran this graph, we can more clearly see that there are more cases where a player seems to end in the area below the $0 profit line. In fact, in this simulation, we find that out of the 100 players:</p>
<ul>
<li><p>25 made a profit , where the average profit was $0.08 <strong>per hand</strong></p>
</li>
<li><p>75 made a loss, where the average loss was $0.13 <strong>per hand</strong></p>
</li>
</ul>
<p>This also shows that the more hands a player plays, the less likely it is that they will end the day with a profit. Even though the casino's edge is very small, it adds up quite quickly in the long run.</p>
<p>When we rerun the distribution plot (as shown below), we also see a similar result, where the average loss per hand is steadily at $0.08 per hand.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1717957250920/5a540219-3e98-4022-b3fd-5f801d8abe47.png" alt class="image--center mx-auto" /></p>
<p>The distribution is significantly narrower compared to the previous simulation. Earlier, the average profit per 10 hands ranged from -$4 to +$4. Now, it fluctuates only between -$0.30 and +$0.20. This reduction in variance is due to the increased number of simulations played (10,000 versus 100). With more simulations played, the results tend to cluster closely around the true average profit per hand.</p>
<p>Consider this analogy: if you flip a coin 10 times, it’s believable to get heads more than 70% of the time (7 or more heads). The probability of this occurrence is roughly 5%. However, if you flip the coin 1,000,000 times, it becomes virtually impossible for heads to appear more than 70% of the time (700,000+ heads). The probability is nearly 0%, as this outcome would be around 400 standard deviations away from the mean. This is an example of the <a target="_blank" href="https://www.britannica.com/science/law-of-large-numbers">Law of Large Numbers</a>, which was proven by the Swiss mathematician Jakob Bernoulli in 1713, and is widely used in the field of statistics, economics, mathematics, etc.</p>
<h3 id="heading-conclusion">Conclusion</h3>
<p>Our analysis shows that, for this specific type of Blackjack game and assuming the player uses the optimal strategy described <a target="_blank" href="https://www.blackjackapprenticeship.com/blackjack-strategy-charts/">here</a>, players will, on average, lose $0.08 per $10 hand. <strong>This indicates a house edge of about 0.8%</strong>. This edge is highly specific to the conditions outlined at the start of the article. If any of these conditions change—such as a 6 to 5 Blackjack payout, the number of decks used, or restrictions on splitting—the house edge will also change. However, regardless of the Blackjack variation, the house will always have an advantage. The only way to potentially shift the odds in your favor is through card counting, which allows you to adjust your strategy based on the current state of the deck. Implementing a card counting feature in the code could be an interesting next step to explore when it turns the profit in favour of the player.</p>
<p>Thank you for reading the very first blog to be posted on mathieutorchia.com. I am excited to explore more questions and learn new methods on Python along the way. Follow my <strong>GitHub</strong> <a target="_blank" href="https://github.com/mathieutorchia/blackjack_simulator">here</a> to see the Python code that made this article possible (I am still learning the ins and outs about GitHub so please bear with me - if anyone has any suggestions, I am all ears). And of course, feel free to reach out if you have any questions or comments!</p>
]]></content:encoded></item></channel></rss>