Featured Post

Rest

 I hope that everybody in the world gets their infinite moment of respite today. 

Wednesday, November 22, 2017

Frequentists, Bayesians, and the Law of Large Numbers

Here is a question I answered I while back on Math StackExchange, wherein the asker seemed to be confused about the undue philosophical credit that the Law of Large Numbers (LLON) supposedly lends to frequentism by essentially proving it. Does it, in fact, say that that probability of any event must correspond to relative frequency? Of course, the LLON does no such thing; in fact, it is provable that since LLON is an artefact of a mathematical theory, it cannot force us to say anything concrete about the real world, and particularly not what interpretation we should give the word "probability". The asker actually gave a very insightful argument: if we can give two different formal probability distributions to the same real event, both of which satisfy the formal axioms of probability theory, how can we possibly interpret them both as real-life frequencies? I expound upon this argument in my answer.
Anyways, it turns out the confusion was caused by unavoidably imprecise wording arising when trying to write the LLON in layman's terms, as is often the case in mathematics (... and in many other fields -- but that's a topic for another time). 

In the following aside, I define the two key terms necessary -- "frequentist" and "Bayesian" -- in the way I've used them in my answer.
                                                                                                   
First, it is important to note that both agree on a theory of probability -- which is a purely mathematical theory needing no particular interpretation. It is typically given by Kolmogorov's axioms, although other equivalent axiomatizations are possible. Classically, Bayesians interpret probability in a natural way: probability is a measure of a degree of belief. On the other hand, frequentists assert that the probability of X is only a measure of relative frequency of X: i.e. the limit of the ratio of X events to the total number of events.

But we stare at the definition for a little longer and notice something strange. The distinction seems trivial; after all, we are separating people into two different camps based on how they use the word "probability"! Surely, one wouldn't think to classify people based on what they think "atheism" means, right? (Incidentally they tend to divide neatly into named groups, but that's beside the point)But upon further inspection, we might discover the intention behind such a shorthand definition.
It's not so much about the word "probability" but what "probability" means to us, feeling-wise. That is, probability can be considered an indescribable qualia, a seemingly natural quality (or perhaps quantity?) or notion embedded in human experience but one which we have difficulty formalizing. Whatever probability is, it corresponds to a "measure of belief", and both Bayesianism and frequentistism are attempts to give it that rigorous structure. Frequentists avoid talking about belief/probability entirely and go about it with an ad-hoc approach -- instead talking about relative frequency of some event X, which if we think about it, should technically exactly inform our belief in event X: "It happens 50% of the time? What other belief should we assign that, if not 50%?"
But beyond that, they can't say much --thus one-off events (will the sun rise tomorrow?) are off limits besides a proper ad-hoc series of trials that effectively relate to it (how often do stars go supernova?).
Bayesians just literally axiomatically TALK about it directly, which isn't as problematic and subjective as it seems -- we do the same when we speak English -- it's supposedly a subjective assignment of words to sensory images in our minds, but we tend to have a standard, a collective consensus. In the same way when a Bayesian statistician says something will happen with 50% probability, we understand that their 50% is probably the same as what we ourselves, in our own minds, mean by 50% (this can actually be checked more rigorously by betting arguments -- see 'Dutch book') .
Additionally it is worth noting that the divide between Bayesians and frequentists isn't as sharp as one might imagine. Bayesians don't necessarily disagree that relative frequency informs belief, like we mentioned with the 50% example (betting arguments may further lend this support). Another interesting way that they coincide is that they are both in a sense very mathematical. Frequentists seek an objective definition for a subjective concept, and faced with the lack of this, simply characterize it conservatively -- only with what they know works, even when it might seem weaker and thus convoluted to apply in real scenarios. For a classical example, see the 20th century attempt to define "computability", or what an "effective method/algorithm" is in mathematics: Turing machines are simple and conservative, and we would definitely call everything it "computes", "computable"... but does characterize all computable functions? That is, is it mathematically equivalent, if not a definition? The answer is yes: professional consensus deems it successful in this regard, in light of many other characterizations of computability being effectively identical. Whether frequentism achieves the same level of success is open to debate. 
And on the other hand, Bayesians exemplify the axiomatic method. Under a Bayesian framework, we don't need to objectively understand what belief is, since we may attach numbers to it still, and operate with a simple set of rules and end up with a coherent system. Bayesians do with "belief" what axiomatic geometry does with the concept of a "line": they leave it undefined as a primitive notion. 
                                                                                                   

The actual answer from MSE follows:

You are correct. The Law of Large Numbers does not actually say as much as we would like to believe. Confusion arises because we try to ascribe too much philosophical importance to it. There is a reason that the Wikipedia article puts quotes around 'guarantees' because nobody actually believes that some formal theory (on its own) guarantees anything about the real world. All LLN says is that some notion of probability, without interpretation, approaches 1 -- nothing more, nothing less. It certainly doesn't prove for a fact that relative frequency approaches some probability (what probability?). The key to understanding this is to note that the LLN, as you pointed out, actually uses the term P() in its own statement.
I will use this version of the LLN:
"The probability of a particular sampling's frequency distribution resembling the actual probability distribution (to a degree) as it gets large approaches 1."
Interpreting "probability" in the frequentist sense, it becomes this:
Interpret "actual probability distribution": "Suppose that as we take larger samples, they converge to a particular relative frequency distribution..."
Interpret the statement: "... Now if we were given enough instances of n-numbered samplings, the ratio of those that closely resemble (within $\epsilon$) the original frequency distribution vs. those that don't approaches 1 to 0. That is, the relative frequency of the 'correct' instances converges to 1 as you raise both n and the number of instances."
You can imagine it like a table. Suppose for example that our coin has T-H with 50-50 relative frequency. Each row is a sequence of coin tosses (a sampling), and there are several rows -- you're kind of doing several samples in parallel. Now add more columns, i.e. add more tosses to each sequence, and add more rows, increasing the amount of sequences themselves. As we do so, count the number of rows which have a near 50-50 frequency distribution (within some $\epsilon$) , and divide by the total number of rows. This number should certainly approach 1, according to the theorem.
Now some might find this fact very surprising or insightful, and that's pretty much what's causing the whole confusion in the first place. It shouldn't be surprising, because if you look closely at our frequentist interpretation example, we assumed "Suppose for now that our coin has T-H with 50-50 relative frequency." In other words, we have already assumed that any particular sequence of tossings will, with logical certainty, approach a 50-50 frequency split. So is should not be surprising when we say with logical certainty that a progressively larger proportion of these tossing-sequences will resemble 50-50 splits if we toss more in each, and recruit more tossers? It's almost a rephrasing or the original assumption but at a meta-level (we're talking about samples of samples).
So this certainty about the real world (interpreted LLN) only comes from another, assumed certainty about the real world (interpretation of probability).
First of all, with a frequentist interpretation, it is not the LLN that states that a sample will approach the relative frequency distribution -- it's the frequentist interpretation/definition of $P()$ that says this.
It sure is easy to think that, though, if we interpret the whole thing inconsistently -- i.e. if we lazily interpret the outer "probability that ... approaches 1" to mean "... approaches certainty" in LLN but leave the inner statement "relative frequency dist. resembles probability dist." up to (different) interpretation. Then of course you get "relative frequency dist. resembles probability dist. in the limit". It's kind of like if you have a limit of an integral of an integral, but you delete the outer integral and apply the limit to the inner integral.
Interestingly, if you interpret probability as a measure of belief, you might get something that sounds less trivial than the frequentist's version: "The degree of belief in 'any sample reflects actual belief measures in its relative frequencies within $\epsilon$ error' approaches certainty as we choose bigger samples." However this is still different from "Samples, as they get larger, approach actual belief measures in their relative frequencies." As an illustration, imagine if you have two sequences $f_n$ and $p_n$. I am sure you can appreciate the difference between $lim_{n \to \infty} P(|f_n - p_n| < \epsilon) = 1$ and $lim_{n \to \infty} |f_n - p_n| = 0$. The latter implies $lim_{n \to \infty} f_n$ = $lim_{n \to \infty} p_n$ (or $=p$ taking $p_n$ to be a constant for simplicity), whereas this is not true for the former. The latter is a very powerful statement, and probability theory cannot prove it, as you suspected.
In fact, you were on the right track with the "absurd belief" argument. Suppose that probability theory were indeed capable of proving this amazing theorem, that "a sample's relative frequency approaches the probability distribution". However, as you've found, there are several interpretations for probability which conflict with each other. To borrow terminology from mathematical logic: you've essentially found two *models*  of probability theory; one satisfies the statement "the rel. frequency distribution approaches $1/2 : 1/2$", and another satisfies the statement "the rel. frequency distribution approaches $1/\pi : (1-1/\pi)$". So the statement "frequency approaches probability" is neither true nor false: it is *independent* as either one is consistent with the theory. Thus, Kolmogorov's probability theory is not powerful enough to prove a statement in the form "frequency approaches probability". (Now, if you were to force the issue by saying "probability should equal relative frequency" you've essentially trivialized the issue by baking frequentism into the theory. The only possible model for this probability theory would be frequentism or something isomorphic to it, and the statement becomes obvious.)