Fireproof Memory: 2017

Wednesday, November 22, 2017

Frequentists, Bayesians, and the Law of Large Numbers

Here is a question I answered I while back on Math StackExchange, wherein the asker seemed to be confused about the undue philosophical credit that the Law of Large Numbers (LLON) supposedly lends to frequentism by essentially proving it. Does it, in fact, say that that probability of any event must correspond to relative frequency? Of course, the LLON does no such thing; in fact, it is provable that since LLON is an artefact of a mathematical theory, it cannot force us to say anything concrete about the real world, and particularly not what interpretation we should give the word "probability". The asker actually gave a very insightful argument: if we can give two different formal probability distributions to the same real event, both of which satisfy the formal axioms of probability theory, how can we possibly interpret them both as real-life frequencies? I expound upon this argument in my answer.
Anyways, it turns out the confusion was caused by unavoidably imprecise wording arising when trying to write the LLON in layman's terms, as is often the case in mathematics (... and in many other fields -- but that's a topic for another time).

In the following aside, I define the two key terms necessary -- "frequentist" and "Bayesian" -- in the way I've used them in my answer.

First, it is important to note that both agree on a theory of probability -- which is a purely mathematical theory needing no particular interpretation. It is typically given by Kolmogorov's axioms, although other equivalent axiomatizations are possible. Classically, Bayesians interpret probability in a natural way: probability is a measure of a degree of belief. On the other hand, frequentists assert that the probability of X is only a measure of relative frequency of X: i.e. the limit of the ratio of X events to the total number of events.

But we stare at the definition for a little longer and notice something strange. The distinction seems trivial; after all, we are separating people into two different camps based on how they use the word "probability"! Surely, one wouldn't think to classify people based on what they think "atheism" means, right? (Incidentally they tend to divide neatly into named groups, but that's beside the point)But upon further inspection, we might discover the intention behind such a shorthand definition.
It's not so much about the word "probability" but what "probability" means to us, feeling-wise. That is, probability can be considered an indescribable qualia, a seemingly natural quality (or perhaps quantity?) or notion embedded in human experience but one which we have difficulty formalizing. Whatever probability is, it corresponds to a "measure of belief", and both Bayesianism and frequentistism are attempts to give it that rigorous structure. Frequentists avoid talking about belief/probability entirely and go about it with an ad-hoc approach -- instead talking about relative frequency of some event X, which if we think about it, should technically exactly inform our belief in event X: "It happens 50% of the time? What other belief should we assign that, if not 50%?"
But beyond that, they can't say much --thus one-off events (will the sun rise tomorrow?) are off limits besides a proper ad-hoc series of trials that effectively relate to it (how often do stars go supernova?).
Bayesians just literally axiomatically TALK about it directly, which isn't as problematic and subjective as it seems -- we do the same when we speak English -- it's supposedly a subjective assignment of words to sensory images in our minds, but we tend to have a standard, a collective consensus. In the same way when a Bayesian statistician says something will happen with 50% probability, we understand that their 50% is probably the same as what we ourselves, in our own minds, mean by 50% (this can actually be checked more rigorously by betting arguments -- see 'Dutch book') .

Additionally it is worth noting that the divide between Bayesians and frequentists isn't as sharp as one might imagine. Bayesians don't necessarily disagree that relative frequency informs belief, like we mentioned with the 50% example (betting arguments may further lend this support). Another interesting way that they coincide is that they are both in a sense very mathematical. Frequentists seek an objective definition for a subjective concept, and faced with the lack of this, simply characterize it conservatively -- only with what they know works, even when it might seem weaker and thus convoluted to apply in real scenarios. For a classical example, see the 20th century attempt to define "computability", or what an "effective method/algorithm" is in mathematics: Turing machines are simple and conservative, and we would definitely call everything it "computes", "computable"... but does characterize all computable functions? That is, is it mathematically equivalent, if not a definition? The answer is yes: professional consensus deems it successful in this regard, in light of many other characterizations of computability being effectively identical. Whether frequentism achieves the same level of success is open to debate.

And on the other hand, Bayesians exemplify the axiomatic method. Under a Bayesian framework, we don't need to objectively understand what belief is, since we may attach numbers to it still, and operate with a simple set of rules and end up with a coherent system. Bayesians do with "belief" what axiomatic geometry does with the concept of a "line": they leave it undefined as a primitive notion.

The actual answer from MSE follows:

You are correct. The Law of Large Numbers does not actually say as much as we would like to believe. Confusion arises because we try to ascribe too much philosophical importance to it. There is a reason that the Wikipedia article puts quotes around 'guarantees' because nobody actually believes that some formal theory (on its own) guarantees anything about the real world. All LLN says is that some notion of probability, without interpretation, approaches 1 -- nothing more, nothing less. It certainly doesn't prove for a fact that relative frequency approaches some probability (what probability?). The key to understanding this is to note that the LLN, as you pointed out, actually uses the term P() in its own statement.

I will use this version of the LLN:

"The probability of a particular sampling's frequency distribution resembling the actual probability distribution (to a degree) as it gets large approaches 1."

Interpreting "probability" in the frequentist sense, it becomes this:

Interpret "actual probability distribution": "Suppose that as we take larger samples, they converge to a particular relative frequency distribution..."

Interpret the statement: "... Now if we were given enough instances of n-numbered samplings, the ratio of those that closely resemble (within $\epsilon$) the original frequency distribution vs. those that don't approaches 1 to 0. That is, the relative frequency of the 'correct' instances converges to 1 as you raise both n and the number of instances."

You can imagine it like a table. Suppose for example that our coin has T-H with 50-50 relative frequency. Each row is a sequence of coin tosses (a sampling), and there are several rows -- you're kind of doing several samples in parallel. Now add more columns, i.e. add more tosses to each sequence, and add more rows, increasing the amount of sequences themselves. As we do so, count the number of rows which have a near 50-50 frequency distribution (within some $\epsilon$) , and divide by the total number of rows. This number should certainly approach 1, according to the theorem.

Now some might find this fact very surprising or insightful, and that's pretty much what's causing the whole confusion in the first place. It shouldn't be surprising, because if you look closely at our frequentist interpretation example, we assumed "Suppose for now that our coin has T-H with 50-50 relative frequency." In other words, we have already assumed that any particular sequence of tossings will, with logical certainty, approach a 50-50 frequency split. So is should not be surprising when we say with logical certainty that a progressively larger proportion of these tossing-sequences will resemble 50-50 splits if we toss more in each, and recruit more tossers? It's almost a rephrasing or the original assumption but at a meta-level (we're talking about samples of samples).

So this certainty about the real world (interpreted LLN) only comes from another, assumed certainty about the real world (interpretation of probability).

First of all, with a frequentist interpretation, it is not the LLN that states that a sample will approach the relative frequency distribution -- it's the frequentist interpretation/definition of $P()$ that says this.

It sure is easy to think that, though, if we interpret the whole thing inconsistently -- i.e. if we lazily interpret the outer "probability that ... approaches 1" to mean "... approaches certainty" in LLN but leave the inner statement "relative frequency dist. resembles probability dist." up to (different) interpretation. Then of course you get "relative frequency dist. resembles probability dist. in the limit". It's kind of like if you have a limit of an integral of an integral, but you delete the outer integral and apply the limit to the inner integral.

Interestingly, if you interpret probability as a measure of belief, you might get something that sounds less trivial than the frequentist's version: "The degree of belief in 'any sample reflects actual belief measures in its relative frequencies within $\epsilon$ error' approaches certainty as we choose bigger samples." However this is still different from "Samples, as they get larger, approach actual belief measures in their relative frequencies." As an illustration, imagine if you have two sequences $f_n$ and $p_n$. I am sure you can appreciate the difference between $lim_{n \to \infty} P(|f_n - p_n| < \epsilon) = 1$ and $lim_{n \to \infty} |f_n - p_n| = 0$. The latter implies $lim_{n \to \infty} f_n$ = $lim_{n \to \infty} p_n$ (or $=p$ taking $p_n$ to be a constant for simplicity), whereas this is not true for the former. The latter is a very powerful statement, and probability theory cannot prove it, as you suspected.

In fact, you were on the right track with the "absurd belief" argument. Suppose that probability theory were indeed capable of proving this amazing theorem, that "a sample's relative frequency approaches the probability distribution". However, as you've found, there are several interpretations for probability which conflict with each other. To borrow terminology from mathematical logic: you've essentially found two *models* of probability theory; one satisfies the statement "the rel. frequency distribution approaches $1/2 : 1/2$", and another satisfies the statement "the rel. frequency distribution approaches $1/\pi : (1-1/\pi)$". So the statement "frequency approaches probability" is neither true nor false: it is *independent* as either one is consistent with the theory. Thus, Kolmogorov's probability theory is not powerful enough to prove a statement in the form "frequency approaches probability". (Now, if you were to force the issue by saying "probability should equal relative frequency" you've essentially trivialized the issue by baking frequentism into the theory. The only possible model for this probability theory would be frequentism or something isomorphic to it, and the statement becomes obvious.)

Saturday, September 23, 2017

Learning hyperplanes

You don't need very many points to learn a lot! Suppose you have a given "experience space".

Each point admits a radius of applicability -- an experience allows you to extend it to some slightly different test case of some radius of difference e.

But say you're given two points far away. That means you can probably interpolate between these two points to construct a "line" covering a large number of test cases. And each point on this line also has a radius of applicability, so we have essentially a thick line. We can also extrapolate -- extend points in different directions.

Example: Suppose you have no idea what "taste" is like. You taste coffee, and you only understand things sufficiently "like" coffee. Then perhaps you taste hot chocolate. You interpolate between the two points, so you can now "recognize" things like frappuccinos and other sweet drinks.

But further you can extrapolate to the extremes, now that you understand that coffee is more "bitter" than hot chocolate, so you can extend from coffee to even more bitter things, like "espresso". You might go the opposite way from hot chocolate, to sweeter drinks. However you can't yet imagine things like fruit juice -- it's an additional dimension (sour). Of course it is a fallacy to assume that the natural-seeming taste basis -- sweet, sour, salty, bitter, spicy -- is the only one. Perhaps we can do some sort of principal components analysis. It might also be that our brains are structured in a way so that upon tasting coffee and hot chocolate, we choose this natural basis to compare the two: i.e. we break it down and say "coffee is more bitter than hot chocolate". Either way, we can at least state that we cannot find a basis vector for which coffee and hot chocolate's projection onto that vector is actually different. For example, since coffee is about as "sour" as hot chocolate, we can't exactly learn "sourness". More precisely: fruit juice will come as a complete surprise to someone who has tasted only hot chocolate and coffee.

It may seem like we're defining "learning" or "covering" as "not being surprised when we encounter it", but we can actually extend this kind of analysis to different examples of "learning". For example, with people: "learning some point (x,y,z)" in this case might mean "knowing how to act around a person with personality values (x,y,z)".

Let us rigorously analyze how it is possible we can construct "lines". It's quite simple: we follow the same analysis as we do in geometry with vectors. As we take the vector defined by the two points and add copies of it to one of our points, we might take "that quality that differentiates coffee from hot chocolate" and "add it to coffee several times" to get "espresso". The picture is kind of like:

HOTCHOC ---------------> COFFEE

Then just copy/extend:

HOTCHOC ---------------> COFFEE -----------------> ESPRESSO

Of course we're essentially doing x = t(v_0) + COFFEE, where t = 2 and x = ESPRESSO, but we let t vary on some continuum and we get a line.

Again it is important to note that in fact we have more than a line -- we have "thick line", since around every point we have a radius of applicability. So it's more like a tube.

An aside: Obviously the sort of thinking wherein you consider events/objects/etc. as tuples is pretty common in fields such as machine learning, but I'd like to point out another way it illuminates a particular phenomena.

As in music theory and learning, we notice that music was done perfectly well far before any real music theory was developed (I mentioned this in a previous post). If we model this in terms of a vector space, when humans developed music, we were discovering, say, some particular subset of a space. When we developed music theory, we stepped back at our subset and performed some principal components analysis to break it down into simpler parts. Again it is tempting to claim that the eigenvectors given by such a method are somehow canonical or natural in the sense that they ARE why we developed music a certain way, but that's not necessarily so!
For analogy, we have seen that there are several symmetric ways to look at an n-gon, and there are several bases for a vector space that are sometimes more convenient than others in particular cases. (In more mathematical detail, there are conjugate transformations that express a particular reflection along another line of symmetry in terms of the original reflection and a "change of coordinates, just as there is a change of coordinates that can express a linear transformation as a simple shear transformation in a vector space.) In the same way, there might be a another nice basis for music that perhaps more clearly explains why we call certain things music (i.e. they belong to the subset, i.e. we "recognize" it) and other things not (i.e. they don't belong to the subset) -- i.e. it might be more natural. Or perhaps there is no real "basis" -- there's some other mechanism (whim?) that causes us to identify what's music and what's not. Nobody says that we identify music by projecting the item onto some vector subspace.

Anyways, we can probably extend this to more than problems concerning "recognizing" -- a lot of different skills: musical instruments, sports, etc. consist of a family of individual micro-skills that are often after-the-fact distilled down to core skills and given structure. Mathematics itself is built the same way -- the clean organizations and categorization of subfields didn't always exist, nor do they need to!

Appendix:

Define "recognize": Essentially tasting them will be "familiar" -- won't feel new -- you can construct it as a point on the line.

Define "radius of applicability": A point has a "radius of applicability". If another point (i.e. experience) is sufficiently "similar" to our "learned" point in that we may also say this new point is roughly "learned", then we say that this new point is within the first point's "radius of applicability". We could hypothetically also quantify "roughly learned" with some finer model, but we want to begin with a simple model.

Monday, April 3, 2017

Change-of-definition does not preserve structure

There will often be words that academics use differently from laymen. For example, a lot of social scientists have a take racism to mean that, by definition, says an underprivileged class cannot be racist. The layman definition is simpler but varies between people. But universally, to any layman, racism carries a strong, personal, and negative stigma that is different from "prejudice". Being called "racist" is very much an insult. Disallowing blacks to be racist, then, provides a kind of immunity for black people, as being "prejudiced" doesn't leave the same bad taste in your mouth. So it is not entirely productive to insist on a definition that isn't widely accepted during a conversation, especially when your definition does not preserve fairness in discussion.

The situation is similar in the religion debate. The atheist community and philosophical circles along with laymen disagree on the terms "atheist" and "agnostic". To the atheists, the word "atheist" simply means those who do not believe in God. However for a long time the word "atheist" has meant, to both philosophers and laymen, strictly a person who disbelieves in God, with the word "agnostic" being reserved for those who hold no belief either way. It was in 1972 that philosopher Antony Flew attempted to change the definition of atheism to the broader sense. On this, Uri Nodelman, editor of the Stanford Encyclopedia of Philosophy, writes:

Not everyone has been convinced to use the term in Flew's way simply on the force of his argument. For some, who consider themselves atheists in the traditional sense, Flew's efforts seemed to be an
attempt to water down a perfectly good concept. For others, who consider themselves agnostics in the traditional sense, Flew's efforts seemed to be an attempt to re-label them "atheists" -- a term they
rejected.

Basically, Flew's definition is favorable to the atheist community, as it labels more people "atheists". But in the view of those who claim themselves to be agnostic in the layman's way, this definition takes away the ability to distinguish themselves from their atheists -- i.e. they lose expressibility. As such, the atheist community's definition of atheism is for the most part unused.

We see another example in the feminist movement. According to feminists, a feminist is simply one who believes in equal rights for men and women. But plenty of people hesitate to call themselves feminists -- and if such a person happens to be a celebrity, or even a female celebrity, they are scorned. It is no wonder, then, that the public's definition of feminism differs from their own. To the public, feminism means actual advocacy. The word refers to a movement, not an idea. However, it is advantageous for the movement itself that the word associates to the idea, since their movement would receive a free upgrade from a controversy-stirring movement to an idea impervious to assault.

The problem is that definitions aren't constructed by textbooks or academics. Formal definitions are post hoc. That is, definitions are written down and systematized only after a large part of society actually starts using the word in some regular fashion. That definitions are innately human, and that definitions are determined by layman consensus is essentially a tautology -- it's simply how words are used. It's not particularly right or wrong, it's just that to attempt to write down your own definition to suit your needs and trying to get everyone else to conform to it by saying "because it's right" is missing the forest for the trees.

We have an analogy in music. Music was written, and existed, long before music theory. Much like dictionary definitions, music theory is post hoc, a fact that is emphasized in music theory courses. Yes, theory provides a nice standard, but it doesn't make sense to preach theory to the thousands of brilliant musicians who don't know how to read music. Additionally, it would make much less sense to construct a music theory that disagrees with how the majority of musicians see music. Science seeks to explain the world via theory. Certain parts of social science do the opposite -- they attempt to conform the world to their theory. Perhaps this is why many do not regard social science as a true science?

In the case of atheists and feminists, it can be argued that their change-of-definitions serves mainly to facilitate communication amongst themselves and further their own groups' cause. From their perspective, this is inherently good and useful. However this actually changes the dynamic of the discussion, by disenfranchising those who wish to distinguish themselves from the group despite not opposing them completely.

In short, adopting a definition does not preserve the status quo. Change-of-basis can add to or detract from expressibility or nuance, and so essentially change the power dynamic in social dialogue.

Monday, January 16, 2017

Optimizing rote memorization

My grandfather has been working at the DFW airport as a janitor for years. Recently, his coworkers all got a 75 cent raise, up to $10 from $9.25. There was one caveat: you have to pass what seems to be a written exam on airport security (the questions relate to TSA policy and the like). The only problem is, my grandpa doesn't speak English, so even new employees are currently being paid more than he is. What makes this situation worse is that the exam is taken on a computer -- my grandpa hardly knows how to move the mouse.

However, we have something we can work with: a list of handwritten questions and answers. My grandpa requested that I print it out with bigger font so he can read it.

But I want to take it further: How can I create a study plan so as to guarantee his success?

We have the following facts:

1. He cannot understand written English.

2. He can, however, recognize English letters and the sounds they make.

3. We have a list of questions, paired with answers (TRUE or FALSE).

4. He hasn't been in school for a really long time.

5. He would have no trouble understanding this if it were in Korean.

Let us examine the list. First, an obvious approach would be to memorize each question and its corresponding answer. But we immediately see that this isn't necessary: each question can be identified with less than its full length. The question now is: is there a uniform minimal "key length" on how many words he should memorize from each question? For example, suppose we are given the following sentences:

· Bob is a chef.

· Bob is not a chef.

· Alice is a chef.

· Alice is not a chef.

Notice that memorizing only the first word of each question is not sufficient to distinguish the questions, since 2 of them start with "Bob". We quickly see that 2 is not enough either, but 3 works -- so 3 is our key length.

After we establish the key length, we need only to learn to associate the keys -- "Bob is a" is one in this example -- with the correct answer (in our case, TRUE or FALSE).

The only problem is, the set of questions we are working with has a minimal key length of almost the entire length of some of the sentences! So here is the situation:

· Bob is a chef.

· Bob is not a chef.

· Alice is a chef.

· Alice is not a chef.

· Carol is an absolutely fantastic chef who lives in New York.

· Carol is an absolutely fantastic chef who lives in New Jersey.

Notice that our minimal key length here would be 10 due to the last two sentences. This means we would be memorizing our first four sentences.

But naturally, would quickly realize that they should memorize the first 3 words for the first 4 sentences and the last word for the 5th and 6th. In other words, the uniform key length restriction is not only inefficient, but also artificial.

Once we throw off the restriction, we need a method to generate keys for each sentence. A key must be unique and effective: that is, there needs to be a fast, human-computable function that takes a sentence and yields a unique key. So essentially we have compressed our data.

If we restrict our range to "substrings starting from the beginning or the end", an example mapping is not hard to figure out:

· Bob is a chef. => Bob is a

· Bob is not a chef. => Bob is not

· Alice is a chef. => Alice is a

· Alice is not a chef. => Alice is not

· Carol is an absolutely fantastic chef who lives in New York. => York

· Carol is an absolutely fantastic chef who lives in New Jersey. => Jersey

The corresponding method to compute this map would be to start from the beginning, try to recognize a substring, and if you fail, start from the end and try to recognize one.

The great benefit of using substrings as keys is that the key actually exists inside of the sentence already, so computing the function is reduced to a recognition task – once you spot the substring in the input, you output it.

[An aside: Human recall/recognition is what I like to call "partially fast" -- if it exists in our memory vault, we're usually pretty quick to recognize it, but if it isn't, the process is a lot less reliable*. We can model this with an associative array, or hash table: lookup is usually really fast, but if it doesn't exist we might have to search through countless locations to verify this. And that’s exactly the thing: we’re bad at searching, since we don’t have location-addressable memory. We can see this effect with test taking: multiple choice is far easier than free-response tasks, since the latter requires searching through a swath of potentially relevant information. Whether we retrieve the proper information depends on whether or not the information has proper associative links (i.e., X reminds me of Y which is related to Z, so I recall Z), and whether our mind has enough time, energy, or luck to traverse the right connections or not. On the other hand, autoassociative tasks are easy as long as the signal is sufficiently clear.]

But we have pigeonholed ourselves into only learning by syntactic methods. There are various other, more natural methods, including but not limited to: sounding out the sentences, mapping the sounds to the correct answer, or interpreting the sentences semantically. The benefit to sounding out the words is that my grandpa does know how to rudimentarily translate groups of letters to sounds, so the function is already there. And sounds tend to be more familiar, so this might help with memorization.

My grandpa also suggested a method: he wants translation of all the sentences to Korean, along with the sounded-out phonetic transliteration. There are two ways we can implement this. The first way would be to take an entire sentence and map it to the meaning. However, to me this is horribly inefficient – if you can map it to a meaning, why not map it to the binary value that we need in the first place—TRUE or FALSE!

The second way makes a little more sense. You take some significant words of each sentence – in our example it might be “chef”, “Bob”, “Alice”, “fantastic”, etc. – and map these to the meaning. This is essentially dimensionality reduction – each sentence is considered as the sum of a few special “parts” – in this case a small set of important words which we will have him learn. The caveat is when the sentences don’t really share much of a common basis – then memorizing the words and their meaning becomes an extra layer of inefficiency. So this method is a little situational.

When the student speaks of “understanding”, dimensionality reduction may be what he/she is trying to do. Memorization of independent facts, or even just strings, is no intellectual challenge, but is time-consuming. So the mind wants for a coherent system, one that is smaller than the initially presented, overwhelming batch of apparently true sentences. Surely there must be a simple framework upon which all of this was derived, the brain thinks.

But perhaps it’s not about smallness, it’s there’s also this factor of familiarity. We are already familiar with a space of countless principles and ideas – so can we “embed” these new concepts as a subspace thereof?

In summary, when it comes to choosing a method for a memorization task, we have to keep two things in mind:

Dimensionality: Are the objects we are memorizing (sounds, words) complex or simple? That is, how large is the smallest set of objects (basis) that form the original set by combinations? For instance, our example with Alice, Bob, and Carol turned out to have a fairly low dimensionality, since for our problem we only had to recognize a few key words.

Familiarity: Are the objects we are memorizing familiar to us – i.e. do the objects or related objects already exist in memory to any extent?

For example, if we substitute every letter in the English language sentence by some unique hieroglyph or code number, although the dimensionality stays the same (we’re just doing a one-to-one transformation), but familiarity is drastically reduced. It is probably easier for most to memorize “fnefhew” than “5 4 1 5 9 1 8”, so it might be easier to learn the encoding and then decode each number into a letter first, especially if you have lots of these numbers. This is probably why sounding out the words works as a method, and how mnemonics actually help you memorize things despite often being longer than the original object. In the same way, memorizing sentences by sound seems to be advantageous.

*Although we seem quick to declare "I don't recognize this", this is not the same as saying "I have never seen it". One may sometimes not recognize a person's face until much later.

Fireproof Memory

Featured Post

Rest