Monte Carlo or Bust - Joseph Buchdahl - E-Book

Monte Carlo or Bust E-Book

Joseph Buchdahl

0,0
20,39 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Almost everyone is familiar with Monte Carlo's association with gambling, and its famous Casino. Many may also have come across the Monte Carlo fallacy, so-called after the Casino's roulette wheel ball fell on black 26th times in a row, costing players, who believed that the law of averages made such streaks impossible, millions of dollars. However, the Casino also lends its name to a tool of statistical forecasting, the Monte Carlo simulation, used to model the probability of uncertain outcomes that cannot be easily predicted from mathematical equations. This book provides a detailed account for how aspiring sports bettors can use a Monte Carlo simulation to improve the quality, and hopefully profitability, of their betting, and in doing so unravels the mystery of probability and variance that lies at the heart of all gambling.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB
MOBI

Seitenzahl: 638

Veröffentlichungsjahr: 2021

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



PRAISE FOR JOSEPH BUCHDAHL

‘Joseph delivers on his promise to familiarise anybody with an interest in betting or investing with the workings of the betting mind through an abundance of practical examples… A book you can’t afford to miss’

– Pinnacle on Squares & Sharps, Suckers & Sharks

‘How to Find a Black Cat in a Coal Cellar ranks amongst the more important books on sports betting’

– Betfair Pro Trader

‘Fixed Odds Sports Betting is one of the best books on betting and statistical analysis’

– A Football Trader’s Path

CONTENTS

Chapter

Praise for Joseph Buchdahl

The Story of the Monte Carlo Simulation

A Little Probability & Statistics

How to Build a Monte Carlo Simulation

Prediction Models

Winning

Losing

Staking

Tipping

Odds & Sods

A Game of Luck or Skill

A Cautionary Tale

Also by Joseph Buchdahl

About the Author

Copyright

THE STORYOFTHE MONTE CARLO SIMULATION

The name ‘Monte Carlo’ (literally Charles' Mountain, after Prince Charles III of Monaco) has been synonymous with gambling ever since the opening of its Casino de Monte-Carlo in 1863. It lends its name to the Monte Carlo fallacy, otherwise known as the gambler’s fallacy or the fallacy of the maturity of chances, the erroneous belief that if a particular event has occurred more frequently than normal during the past it is less likely to happen in the future, and vice versa. On the fateful night of 18 August 1913, the roulette ball kept landing on black spin after spin. The longer the sequence continued, the more people started to take notice and place bets on red, believing that such an unlikely sequence could not possibly continue if reds and blacks occur about half of the time each, discounting the influence of the green zero. After the 26th consecutive black, with a probability of less than 1 in 136 million, a lot of roulette players had lost a lot of money. It did not stop there; having seen a long sequence of blacks end on the 27th spin, some players now believed it would be followed by another long sequence of reds to redress the balance.

Belief in the Monte Carlo fallacy stems from a mistaken interpretation of the law of large numbers, more commonly and wrongly understood as the law of averages, where the individual believes that, following an unlikely sequence of events, things must even out to ensure that observations match expectations. Independent events like coin tosses or roulette wheel spins, however, do not have memories. There is nothing compelling them to return towards an expected average, just a mathematical tendency for this to happen as the sample of observations becomes larger and larger.

Monte Carlo is also famous for its car rally, first raced in 1911 and immortalised in the 1969 film Monte Carlo or Bust! Whilst this book has nothing to do with car racing, it has afforded me the opportunity to come up with a snappy book title; all the more fortunate since the film was originally intended to be called Rome or Bust! So why Monte Carlo? In addition to the fallacy, the municipality’s connection to gambling also lends itself to a computerised method of repeated random sampling to obtain numerical results when more formal mathematical approaches prove too difficult. It is a technique used to understand the impact of uncertainty in prediction and forecasting. It helps us define the most likely or expected outcome, for example a result of a tennis match given some quantified superiority of one player over another, or the most likely betting returns from a series of wagers given some information about the predictive abilities of the bettor. In addition to the most likely outcome, we can also use it to estimate the range of possibilities that surround it, which is very often more informative than simply knowing what is most probable. Since chance and random outcomes are central to the modelling technique, much as they are to games like roulette and dice played at the Monte Carlo Casino, it was perhaps an obvious choice of names. Indeed, its 1946 origin story is all about cards, as we shall see.

While there is some debate about the nature of the first application of the Monte Carlo method, with some suggesting its use may date back as far as the times of the ancient Babylonians, it is generally accepted that the first modern Monte Carlo experiments were carried out during the latter part of the 18th century. One notable example was the Comte de Buffon’s needle problem: what is the probability that a needle thrown randomly on to a horizontal plane ruled with parallel straight lines will intersect one of them, assuming the length of the needle is less than the distance separating the lines. The problem was named after George-Louis Leclerc, a French polymath, also known as the Comte de Buffon, who first proposed the thought experiment. Amongst his other scientific exploits was his estimation of the age of the Earth at about 75,000 years, at a time when the 18th century consensus was that, following the Old Testament, it could not be older than about 6,000 years. Whilst his needle problem can be solved precisely with integral geometry, a simpler way to estimate a solution is to throw a sample of needles on to the surface and count how many of them intersect a line. By repeating this many times and calculating an average, one can arrive at an ever more refined estimation of the probability.

Adapted from Johansen, Adam M. (2010) Monte Carlo methods. In: Baker, Eva L. and Peterson, Penelope L. and McGraw, Barry, (eds.) International Encyclopedia of Education (3rd Edition). Burlington: Elsevier Science, pp. 296-303.

A single run of this experiment might not provide a reliable answer. Perhaps the grains happened to clump more in the middle than at the edges, just because of luck. To increase the accuracy of the estimate, one would repeat this many times. The greater the number of runs, iterations or samples, the more we can be sure that we eliminate the influence of luck in the way the rice grains land, provided there is no underlying bias in the way we are throwing them.

This repeated random sampling is the basis of the Monte Carlo method. It was not until the middle part of the 20th century, however, that the Monte Carlo method gained its name and popularity as a technique for solving deterministic problems probabilistically. Stanisław Marcin Ulam, a Polish-American scientist who had worked on the Manhattan Project into the development of nuclear weapons towards the end of the Second World War was, in January 1946, recovering from surgery following a bout of encephalitis, when his mind wandered on to the topic of calculating the chances of winning a game of Canfield Solitaire. Named after noted gambler Richard Canfield, owner of the Canfield Casino in Saratoga Springs, New York, at the end of the 19th century, the game is notoriously hard to win, with only about 1 in 30 attempts successful. Ulam recounted his inspiration as follows:

‘After spending a lot of time trying to estimate them by pure combinatorial calculations, I wondered whether a more practical method than “abstract thinking” might not be to lay it out say one hundred times and simply observe and count the number of successful plays.'

By writing a bit of code to replicate the rules of play, Ulam was suggesting a computer could be used to replicate the evolution of many games far more quickly than playing a series of hands oneself. Having done so it would then be a simple matter of counting how many of the played hands ended with a successful completion to estimate the probability of it happening. Obviously, the more repetitions in the simulation, the more reliable the estimate will be. The Manhattan Project had been a motivating force behind the development of computers. Ulam realised that the availability of such computing power made such a statistical method easily achievable. Computers such as ENIAC, or Electronic Numerical Integrator and Computer to give it its full title, were being designed with military purposes in mind. The simulations that were run on such computers were thus regarded as secret government work, and hence needed a code name. ‘Monte Carlo’ was chosen as a nod to the Monte Carlo Casino, where Ulam’s uncle, borrowing money from relatives, would gamble. It would seem to represent a very appropriate choice.

The Monte Carlo method is now used in many fields of investigation where uncertainty of outcome plays a significant role, including finance, weather forecasting (where it is known as ensemble forecasting), engineering and the development of artificial intelligence to name but a few. It has even been used in baseball to prove that the sacrifice bunt, where a batter aims to advance his fellow team players to other bases, often at his own expense, is an ineffective strategy. We can use it in betting too. In this book, I will look at how simple Monte Carlo simulations can be used to assist the bettor in a number of domains: forecasting outcomes, expectations about winning and losing, the role of money management, the influence of luck, and the assessment of touts and tipsters amongst other things. I will do this with the aid of arguably Microsoft’s best consumer product, Excel, the ubiquitous spreadsheet tool first released in 1985, which organises data in columns and rows that can be manipulated through formulas to perform mathematical functions on the data. Whilst there are other more powerful data analysis and programming packages like SPSS, SAS and R available, they may require a more comprehensive mathematical and programming background to use comfortably. Excel, however, is easier to learn and has a proven longevity, having been made available via Microsoft’s suite of Office software. It’s probable that most of you reading this will have either used it at one time or are familiar with its basic functionality on a more regular basis. And it’s great for organising the sorts of data – bets, odds, stakes, profits, losses etc. – that you will be handling if you have any aspirations of becoming a more serious bettor. Throughout, I have assumed that readers will have a basic working knowledge of Excel’s functionality. If you don’t, it doesn’t take much to self-teach; that, after all, is how I acquired it, and what you don’t know already can easily be Googled.

In writing this book, I’ve attempted to do something that thus far, I possibly haven’t been particularly good at, but which a friend of mine challenged me to try: to explain betting to those who know nothing about probability. This is like trying to explain skiing to someone who’s never seen snow. I don’t think that’s practically possible, so instead, I’m going to try to explain the world of probability from the bottom up, in the easiest possible terms, introducing new concepts gradually without, hopefully, losing too many readers along the way. Explaining betting without explaining probability is pointless, because betting odds, after all, are just another way of representing probability. Any sensible and serious discussion about betting, therefore, must always begin with understanding the mathematics of likelihood: statistics.

The poet John Lydgate once famously said, “You can please some of the people all of the time, you can please all of the people some of the time, but you can’t please all of the people all of the time”. I fear with this project this is where I will end up. There will be some who will chastise me for encouraging anyone, via the subtitle of this book, to aspire to be a bettor. To those I say this: yes, betting has its dangers, and unfortunately for a few this will come to impact their lives, and the lives of those close to them, in unpleasant ways. But for most, it can be fun, provided it is not indulged excessively and beyond one’s means, and for some, if they are prepared to put in a little effort, it can be rewarding too, even if not necessarily profitable. But a word of caution here, and again, I can hear the criticism ringing in my ears. Betting to win, that is to make a consistent and sustained profit over the long term, is exclusively the domain of the “one percenters”. It takes hard work to make it pay. There are reasons for this, which I intend to address within. I apologise if you find this message discouraging, but I’d rather be a realist than an idealist when it comes to telling the story about betting.

Then there will be those with a suitable mathematical training who may consider most of what follows largely superfluous and obvious. For those I hope at least they might find something new and of value inside. Finally, there will be those who either can’t remember how to multiply together two fractions or have little inclination to want to learn; they will probably lose the will by the second page of the next chapter. For you, unless betting is simply a recreational pastime (and that’s perfectly acceptable, by the way), I would suggest not to bet ever again. Otherwise, I hope that you will make the effort to learn new things. The reason I’ve chosen to tell this betting story via the Monte Carlo simulation is that perhaps it, more than any other tool, allows me to find a balance between these mathematical backgrounds. It’s sophisticated enough to provide a meaningful interpretation to some of the ideas that bettors deal with, yet simple and intuitive enough to easily understand what it tells you without having to know any of the mathematics that powers it. Despite this, however, you will always be better prepared to take on the bookmaker if you know a little maths. Thus, I will begin, first, with a little primer on probability. By the end of the next chapter, hopefully the mathematical novices amongst you will be more comfortable with statistics, the concept of expectation and a probability distribution.

A LITTLE PROBABILITY& STATISTICS

A few things are certain. We might include in that subset the sun rising tomorrow, or that it will not rain at the South Pole, although even those events technically have a finite but infinitesimally small chance of being false. But many things we experience in everyday life are uncertain; will it snow on Christmas Day, will I pick a King when drawing a card from a deck, will Liverpool win the Premier League again? We humans like to quantify the likelihood of uncertain things happening. It gives us a sense of being in control, even if, ultimately, we often prove to be wrong about the numbers. The likelihood or chance of something happening has, since its development in the 18th century, been described by the mathematics of probability.

Probability

On probability, Wikipedia has this to say:

Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speaking, 0 indicates impossibility of the event and 1 indicates certainty. The higher the probability of an event, the more likely it is that the event will occur. A simple example is the tossing of a fair (unbiased) coin. Since the coin is fair, the two outcomes (‘heads’ and ‘tails’) are both equally probable; the probability of ‘heads’ equals the probability of ‘tails’; and since no other outcomes are possible, the probability of either ‘heads’ or ‘tails’ is ½ (which could also be written as 0.5 or 50%).

For bettors who simply bet on one single thing at a time, for example the winning team in a football match or the correct score, this is pretty much the only probability rule they need to know. Unsurprisingly, such bets are known as singles. The hard part, of course, is figuring out how to calculate, or forecast, those probabilities. Others, however, may prefer betting on more than one thing at the same time, for example the outcome of two matches together. Provided they are independent of each other, with the outcome of one match not influencing the outcome of the other, then the probability of both occurring is described by the multiplication rule.

The final probability rule that will prove useful for bettors to know is the addition rule. Where two events, A and B, are mutually exclusive, that is to say they cannot occur at the same time, the addition rule states that the probability of either A or B occurring is given by p(A) + p(B).

For events that are not mutually exclusive, for example drawing a queen or a spade from a deck of cards, the rule changes subtly. Here, one of the possible cards satisfies both conditions. Now we must use the following rule:

Odds

So much for probability; but don’t bookmakers quote things in odds? Yes, they do, but odds are just another way to describe the likelihood of something to happen; that is to say, the probability. Bettors in the UK are familiar with expressions like 2 to 1 against (written as 2-1, 2:1 or 2/1). These don’t mean quite the same thing as a 1-in-x chance, and hence don’t exactly correspond to the probability fractions, even though this odds notation is known as fractional. Here, 2 to 1 against means that 2 out of every 3 times we expect our forecast outcome not to happen, whilst 1 out of every 3 we expect it to occur. Consequently, odds of 2 to 1 against imply a probability of 1/3. Conversely 2 to 1 on (written as 1-2, 1:2 or 1/2) would imply that 2 out of 3 times our predicted forecast will happen and fail to happen 1 in 3 times.

If the sum of the probabilities for all possible outcomes of an event is 100% (or 1), then why do the probabilities implied by bookmakers’ odds come to more than 100%? Let’s look at an example. The final of the French Open in 2020 was played between Nadal and Djokovic. The bookmaker bet365 quoted decimal odds of 1.72 and 2.1, respectively. This implies that bet365 believed Nadal had a 1/1.72 or 58.1% (or 0.581) chance of victory, whilst for Djokovic it was 1/2.1 or 47.6% (or 0.476). Summing the two makes 105.7% (or 1.057). That makes no sense; you can’t have the probabilities for all possible outcomes sum to more than certainty, right? The answer to the original question is to be found by remembering that bookmakers are not charities designed to give you a fair chance of winning. They are businesses which exist to make money themselves for the effort they go to offer people bets in the first place. Instead of charging you an entry or subscription fee, they charge you by shortening the odds. The amount they do this by can be seen by the size of the excess beyond 100%. In this case the excess is 5.7% (or 0.057). This is called the bookmaker’s margin. In reality, bet365 probably believed that Nadal had a 56% chance of winning, Djokovic 44%. Had they quoted fair odds without their margin included, we would have seen 1/0.56 or 1.79 and 1/0.44 or 2.27.

The bookmaker’s margin provides a measure, albeit indirectly, of how much profit they are aiming to make. Sometimes you might hear the term ‘overround’. The overround is simply the sum of the probabilities, or the margin plus 100%, in this case 105.7%. Confusingly, you might also have come across the term ‘vig’, short for ‘vigorish’. Its usage is more common in America. The vig is analogous to the margin but not precisely synonymous. It is a direct measure of the bookmaker’s expected percentage profit on the total stakes taken on an event. The vig and margin are related in the following way:

Thus, the margin and vig are what are called bijective reciprocals.

If bookmakers’ odds are unfair, how can you make a profit? Well, firstly you can get lucky. Betting is largely a game of chance where you win some and you lose some. The problem is that if you kept betting and betting many times like this, in the end all the good and bad luck would cancel out and you’d end up losing an amount that would be dictated by the bookmaker’s margin, or more precisely the vig. Betting, unlike roulette, however, can also be a game of skill, although it’s a rather difficult game to become good at. This possibility arises because the true probability of an event in sports cannot be known perfectly, unlike in roulette where simple mathematics allow one to calculate the odds exactly. Given this, the possibility always exists that the bookmaker has made a mistake. The skilled bettor’s job is to learn how to find those mistakes. Sometimes they are large enough that even after the bookmaker has applied their margin, the odds will still be longer than the true odds (whatever they may be). Suppose in this example Nadal really had a 60% chance of winning, and he really would win 60 out of every 100 matches played against Djokovic indefinitely in exactly the same circumstances, more than the number bet365 believes. OK, so we still don’t know which ones he wins and which ones he loses – luck will dictate in the short term how well we will do – but we do now know that 60 times out of every 100 we will make a profit of £0.72 for a £1 stake, whilst the other 40 times we’ll lose £1. A quick summing up of the net profits and losses reveals that, overall, we should make a net profit of £3.20 for every £100 staked. This sum is known as our expected profit. It’s not guaranteed because good and bad luck can influence it just as they do for a coin toss, but the mathematics of the probabilities tell us that this is what profit we should expect to make on average after good and bad luck have cancelled out. We will return to the concept of expectation again a little later. For now, there is an important take-home message: the accuracy of our expected profit calculation depends entirely upon our accuracy in ‘knowing’ the ‘true’ chances of Nadal beating Djokovic. (I use inverted commas to remind you that, in reality, knowing the true probability perfectly is impossible; I will review why a little later.) If we’re wrong, then what we expect to happen may be far removed from what actually ends up happening, on average. Furthermore, the problem of good and bad luck in the short term (in fact even over quite long terms, as I’ll show later in the book) will often have us deceived.

Some statistics

To many non-mathematicians, ‘statistics’ can be a dirty word used in conversations to argue that your opponents can basically say anything they like because the ‘statistics’ prove their case. Perhaps there is a kernel of truth to that, but for the most part statistics should simply be seen as a way of organising, analysing, describing, and interpreting data to help answer questions that you may be asking about uncertain events. For example, when attempting to forecast the likelihood that a team or player you want to bet on has of winning, you might want to know how many times they have won in the last 4, 6 or 10 games. Perhaps you might also want to know how many goals a football team has scored in the last 10 games. If the total is 30, then that tells us the average is 3 goals per game, since 30 divided by 10 is 3. You might also want to see how those goals have been scored in the past 10 games. It’s highly unlikely that the team scored 3 goals in every game. Perhaps they won their most recent 4 games 6-0 and then scored only once in each of the 6 games before that. There are statistics that can give us useful information about how those goals have been distributed. Perhaps most significantly of all, statistics can help us unpick the competing influences of luck and skill that are so deeply intertwined in betting.

Broadly speaking, there are two types of statistics that will concerns us here: descriptive statistics, which summarise the nature of the data, like the average or by how much it varies; and inferential statistics, which attempt to infer or draw conclusions from data that are subject to random variation. Random is just another word for chance or luck, implying no cause, or at least hidden causes that we are unable to ascertain. For Henri Poincaré, the famous 19th French mathematician, luck was simply a measure of our own ignorance.

‘Every phenomenon, however trifling it be, has a cause, and a mind infinitely powerful, and infinitely well-informed concerning the laws of nature could have foreseen it from the beginning of the ages. If a being with such a mind existed, we could play no game of chance with him; we should always lose.’

So much of what happens in sport is luck. Think of a tennis player with a 70% first serve percentage. 70% of the time they will make a first serve. But what determines whether their next serve will be successful or a fault? There are so many factors – hidden variables as Poincaré might call them – which will dictate the outcome, operating in a sequential line of cause and effect. The speed and trajectory of the ball toss will depend on the state of the player’s arm muscle fibres, the positioning of the opponent, the movement of the air and so on. The way the racket connects with the ball, and ultimately the speed and direction it is sent, will depend on the movement of the serving arm, the server’s eyes and perhaps even the movement on the opposite side of the court of the opponent. All these influences will, in turn, depend on the nerve impulses operating in the server’s brain, that send signals to the nerves of the muscles engaged in the serving action. Tiny differences in the starting conditions of any of these can, in some instances, magnify through the cascade of events that takes place during the serve, to the point where we might see a completely different outcome. Colloquially, this process has come to be known as the butterfly effect, from the idea that the simple air perturbation arising from the flapping of a butterfly’s wings could, two weeks hence, result in a hurricane thousands of miles away. In chaos theory, the butterfly effect describes the sensitive dependence on initial conditions in which a small difference in one state of a system can result in large differences in a later state. If we could know precisely all these tiny initial differences in the server’s action, the motivations behind them, and how their influence cascades through the system over time, we’d know for every serve whether it would be successful or not. But we can’t possibly know this much, as much as Poincaré’s hypothesised infinitely powerful and infinitely well-informed mind. All we can know is that over a large number of previous first serves, the player has historically been successful 70% of the time, from which we infer that they have a 70% chance of being successful with their next serve. It is the imprecision of knowledge about causes that creates uncertainty, and it is the uncertainty which means we must use the language of probabilities, not guarantees or certainties, to describe outcomes in sports, and hence outcomes in betting too.

Never mind a single tennis serve; imagine the number of hidden variables operating in a 90-minute football game with 22 players on a 100m by 65m football pitch making 1,000 passes of the ball that could potentially influence the outcome. The number and nature of these competing but hidden variables is so immense that it is simpler to give them all a name: random. The word is often understood to mean uncaused. In the context we are talking here, that is not quite correct; certainly, Poincaré would say so. Rather the word is best used to describe a process that is sufficiently complex, such that the outcome is completely unpredictable given the information we have. For our purpose that will suffice, although it is worth noting in passing that at the scale of the subatomic (quantum mechanical) world, the meaning of ‘random’ can indeed imply ‘causeless’. Since big things like players’ brains are made up of little things like atoms and quarks, philosophically speaking, at least, we might be forced into redefining what ‘random’ actually means. But this is neither the time nor the place for a philosophical debate about causality. For another book, maybe.

If sporting outcomes are heavily influenced by random variables, that must mean the bets we strike on them are heavily influenced by them too. Statistics provides us with the tools to reveal what influence they have and gives the bettor some means of separating luck from forecasting skill, if indeed they have any. The job for the bettor is to try to uncover as many hidden variables as they can to make a better estimate of the ‘true’ probability of an outcome than the bookmaker. The task is fraught with danger. We might, for example, forget to pay attention to a highly significant variable. Suppose our tennis player’s opponent is a top 10 player. Against top 10 players they have a first serve success percentage of 60%, because they try harder to serve faster and with more precision against a better player. If we happen to overlook this variable – the quality of whom they are playing – we are likely to draw less reliable conclusions than our bookmaker, and that can cost us financially.

We can, in addition, draw incorrect inferences from correlations between variables. Correlation, a mutual relationship or connection between two (or more) things, doesn’t always imply one of them caused the other; when it doesn’t, it is known as a spurious correlation. Some spurious correlations are obvious, and funny, for example the number of Americans who drowned by falling into a pool correlates very well with the number of films Nicolas Cage appeared in between 1999 and 2009. Only a fool, however, would believe Nicolas Cage’s level of creative output had an influence on pool drownings. Others, however, are trickier to spot, particularly when the patterns which emerge from correlations make it easier to infer a causal relationship. Between 2012 and 2017, teams that played away in the English League 2, the fourth tier of English professional football, won so often that, even after taking into account the bookmaker’s margin, you could have made a good profit just betting on all of them. But does that mean something had changed in League 2 that caused teams playing away to win more often than the bookmaker believed, and which, furthermore, the bookmaker hadn’t noticed? Almost certainly not. This profitable run, from the point of view of the bettor, was almost certainly nothing more than pure luck. In the end, if there is no causation, good luck should eventually run out, as it did in this case.

Let’s looks at some descriptive statistics. Recall the 10-game goal-scoring record, where our team has scored 30 goals. Dividing 30 goals by 10 games gives a figure of 3 goals per game. In common parlance, this would be known as the average. Statisticians, however, use different types of averages, or more technically called central tendencies, which mean slightly different things, just as the Inuit have different words for ‘snow’ and Hawaiians have different words for ‘wave’. These different averages can prove more, or less, useful depending on the context and the data that you have. Our 3-goal-per-game average here would be known to a statistician as the ‘mean’, or more specifically the ‘arithmetic mean’. An arithmetic mean is calculated by adding several quantities together and dividing the sum by the number of quantities. Provided there are 30 goals scored over 10 games, you will always calculate the same arithmetic mean regardless of which games those goals were scored. Even if the teams scored 30 goals in 1 game and no goals in the other 9, the arithmetic mean is still 3. There are three other means: the weighted mean, the geometric mean and the harmonic mean, but it would be rare for a bettor to ever need these, so I won’t waste time now describing them. Feel free to look them up. I’ll introduce one or two of them later as and when I need to. Where the word ‘mean’ is used on its own, it’s safe to say that it is the arithmetic mean that is implied. It is certainly the most commonly used. Recall from a little earlier the expected profit of £3.20 for every £100 bet. ‘Expected’ is just another way of saying ‘mean’. We didn’t know which order we would win and lose bets on Nadal but our expectation was that, on average, we would make £3.20 for every £100 bet, or £0.032 for every £1 wager, if all our wagers were £1. ‘Expected’ and ‘mean’ essentially mean the same thing, pardon the pun.

Sometimes, however, the arithmetic mean is not a particularly useful way of describing the average. Suppose in the 10 games, our team scored the following goals: 3, 0, 1, 0, 11, 8, 0, 0, 6, 1. An unlikely set of goal tallies, granted, but it will help to describe two additional averages. Here, you will notice that as many as 6 of the goal counts are below the arithmetic mean and only 3 of them above it. For such a skewed set of data, the arithmetic mean is then not as useful at describing the central tendency. Instead, we might observe that the most common number of goals scored is 0 – it happens in 4 out of the 10 games. This would be called the ‘mode’, the most frequent value in a set of data. It describes what our team does most often. Another type of average is known as the ‘median’. It describes the middle number in a given set of data when it is ordered sequentially. Re-ordering the goals scored we have: 0, 0, 0, 0, 1, 1, 3, 6, 8, 11. The middle number in the sequence is 1, so the median of the data is 1. Like the mode, the median is typically a better measure of the central tendency of some data when that data is skewed or asymmetric. In the chapter on staking, we will see that the median can be a much better measure of profit expectation than the mean where the amount you bet is measured in percentages rather than fixed units.

The central tendency of a set of data is just one way to describe it. Another is knowing how it varies, or how it is spread. The most basic way is by means of the ‘range’. The range is simply the difference between the highest and lowest values, in this case 11. It’s descriptively informative but not very powerful. A measure of variability far more widely utilised by statisticians is something called the standard deviation. The standard deviation, written as ‘s’ or with the lowercase Greek letter σ (sigma), is the average amount of variability in your data. It tells you, on average, how far each score lies from the arithmetic mean, which is often denoted by the Greek letter μ (mu). The larger the standard deviation, the more variable or spread out the data is. It’s with the introduction of the standard deviation that most non-mathematicians start to lose the will, and that’s hardly surprising when you see the formula,

Nevertheless, it is an incredibly important descriptive statistic to know for bettors who are serious about trying to win and profit beyond the simple entertainment that betting may offer, so I will persevere in explaining how it is calculated. You only have to bother reading this once. Thereafter, calculators or Excel can do all the work for you. The important point is that you understand the concept behind it, rather than the mathematics involved. The arithmetic isn’t that hard anyway. You only really need to know what a squared number and a square root are.

Let’s go back to our ten goal tallies. There are 6 steps to finding the standard deviation in these data, all of which are contained in the formula above. Going through them is a much easier way than trying to interpret it if you lack a mathematical background.

1)Calculate arithmetic mean. We already know this is 3. This is your value of μ in the formula.

2)Subtract the mean from each of the ten goal tallies to calculate each difference, or deviation, from the mean. In the formula, each goal tally is described by the letter x. The subscript ‘i’ merely describes how many x’s there are, in this case 10 of them. xi means the ith value of x.

3)Square all the deviations. A square of a number is a number multiplied by itself. 5-squared, written 5², is 25, 10² is 100. (½)² is ¼.

4)Add all the squared deviations together. The shorthand used in the formula is the character ∑. This simply means ‘the sum of’ and is a quicker way of writing the formula than to have to write (x – μ)² 10 times in a long line.

5)Divide this number by the total number of data points, n, in this case 10, since there are 10 goal tallies.

6)Calculate the square-root of this number. A square root is just the reverse of squaring. For example, the square root of 36, written as , is 6.

It is easier to show these steps in a table.

You may remember that squaring a negative number makes a positive number, since any two negative numbers multiplied make a positive one. That is the point of squaring since the sum of the deviations from the mean will always come to zero, so would be a useless statistic to measure the variability of a set of data. Our goals data has a sum of squared deviations of 140, hence an average squared deviation of 14, and finally a standard deviation, σ, of the square root of 14, written , which equals 3.74. This means that, on average, a goal tally deviates from the mean by 3.74 goals.

Sometimes there is a slight difference to step 5, depending on what type of data you have. If the data being analysed represent a population on its own, we divide by the number of data points, in this case 10. More usually, however, the data will represent a sample taken from a larger population. For example, the population might be all 38 goal tallies for a team scored in a Premiership season, from which we select a sample of 10. At other times, we may not know the population but assume that our data represents a sample from it; we then draw inferences from its descriptive statistics and extrapolate those to the population. In such cases, the number we use to divide the sum of the squared deviations by is n – 1. In this case it would be 9, and we use the symbol ‘s’ instead of ‘σ’, which is used for population standard deviation. Why we divide by n – 1 and not n for samples is rather complex and beyond the scope of this book. All that you need to observe is that as the sample size, n, of data points increases, the closer s approaches σ, since n – 1 as a proportion of n tends increasingly towards 1.

Sometimes you may also see the word ‘variance’ used. Descriptively, it is a qualitative measure of how much your data vary or are spread. Mathematically it is the average of the squared deviations from the mean. In other words, it is just the square of the standard deviation. The more spread out the data, the larger the variance is in relation to the mean. In later chapters you will see how these descriptive statistics are used to describe the data that you will be handling as part of your betting, from building forecast models to analysing profits and losses.

Distributions

The standard deviation and variance of a set of data tell you how that data is spread. However, it is also useful to know how the data is distributed, and most significantly how often each value occurs. Typically, the data in a distribution will be ordered from smallest to largest, and by means of a chart or histogram, allows you to easily see both the values and the frequency with which they appear. A distribution can help you calculate the probability of any one particular observation in some data, or the likelihood that an observation will have a value which is less than or greater than a point of interest.

Let’s plot a simple frequency distribution of the 30-goals data, showing the number of times a particular number of goals were scored over the 10 games. From the distribution below, what is the probability that our team scores 0 goals? If this sample of goals is representative of our team’s goal scoring output over a much larger number of games (the population), we could say that they have a 4-in-10, or 40%, chance of failing to score. To calculate this, simply divide the number of times they failed to score, 4, by the total number of observations, 10. What is the probability our team scores more than 3? Again, we simply divide the number of times in this sample where that has happened, 3, by the total, 10, giving 30%.

Of course, in our example, the sample is so small that we probably didn’t need to bother drawing the distribution to make these calculations. When data sets are much larger, however, and particularly when they may be closely represented by well-known data distributions that have predefined mathematical functions that describe them, they can be particularly useful. Let’s look at some of them.

A uniform distribution is a probability distribution in which all outcomes are equally likely. The drawing of a suit from a card deck will conform to a uniform distribution because the likelihood of drawing a heart, a club, a diamond or a spade is equally likely, 25%. So, too, for drawing any of the 13 numbers. On a frequency histogram, every bar would have the same height. In practice if you were to draw cards, you might see deviations from those expected probabilities. These deviations from expectation are just down to chance but if you kept drawing cards for an infinite length of time, the heights of the bars would be the same. I drew a card (with replacement after drawing) from a deck 10,000 times – well, actually I simulated it on a computer – and here were the results.

Look again at the distribution of goal tallies in the histogram earlier. Clearly, that is not a uniform distribution. There are more occasions when fewer (or no) goals are scored than when many are scored. In fact, if we look at goal scoring by football teams in general, this remains the case for large numbers of games. The frequency distribution below shows how often a particular number of goals were scored by Premier League home teams during the 2017/18 to 2019/20 seasons, a total of 1,140 separate goal tallies with a total of 1,754 goals scored.

I’ve shown the frequencies for 0 to 6 goals. There were only two occasions in the three seasons when the home team scored more. We see a similar distribution for away teams, although because of home advantage, there are proportionally more low-score tallies. Again, for clarity, I’ve omitted the single occasion when there were more than 6 goals scored (Leicester’s 9-goal thrashing of Southampton).

Rather than showing the absolute frequencies or counts of different goal tallies, we can instead display them as percentages of the total. For example, 3 home goals were scored on 130 occasions, or 11.40% of all the goal tallies (130/1,140).

These distributions have low arithmetic means, 1.54 goals for home teams and 1.20 goals for away teams. Consequently, they are highly asymmetric, or skewed, where most of the data are pushed towards the left-hand side and there is a longer tail with higher values to the right. This asymmetry also ensures that the median is even smaller than the mean, since a few much larger tallies will introduce an arithmetic bias when calculating the mean. In this case the halfway point in both home and away distributions (at the 570th data point) is 1 goal.

The distribution of goals in football conforms quite closely to something called the Poisson distribution, named after the French mathematician Siméon Denis Poisson, who developed the mathematical function that underlies it. Like the uniform distribution, the Poisson distribution counts the number of things, in this case the number of times a particular tally of goals was scored. Such distributions are called discrete since we are counting discrete events. For the uninitiated, the Poisson distribution function, unlike the uniform distribution function, is algebraically more complex, so I won’t frighten you with it. You can always look up Wikipedia if you’d like to see what it looks like. For our purposes here it’s worth noting one significant aspect of the Poisson distribution: the arithmetic mean is equal to the variance, or the square of the standard deviation. Poisson distributions with a low mean and variance, as above, will tend to exhibit a high degree of asymmetry. Those with a higher mean and variance will be more symmetrical. For my Premier League goals, the variance is actually a little higher than the mean for both home and away distributions, 1.69 and 1.41 goals respectively, but these are not terrible matches and Poisson is not a bad approximation for the distribution of goal tallies in football. I’ve calculated the precise home goals frequencies that would be predicted if the data perfectly fitted a Poisson distribution, and plotted these as a comparison to the original data.

Similarly, for away goals.

In the next chapter I will explain how you can perform this calculation in Excel.

One final observation worth remembering about the binomial distribution is that, whilst in general there is no single formula to find the median for a binomial distribution, if our value for np is an integer or whole number, then the mean, median and mode will all equal np. This means that, in our first example, the mean, median and mode will all equal 5, and 3 in the second asymmetric example. So many betting propositions are fundamentally binary in nature, with just two possible outcomes: win or lose. Consequently, understanding these simple rules behind the binomial distribution can prove immensely useful for bettors analysing their prediction strategies, staking plans and betting histories.