Fireproof Memory

Saturday, April 11, 2020

birthdays

There's nothing special about being born.
Well... That's not quite true, we (humans) are pretty amazing, if only as information processors, neural networks.
But a neural network is really not all that interesting in a vacuum, is it? What does it do? What is it supposed to do? How is it useful?
In the same way, none of us come into this world with a sense of meaning or purpose (at least I didn't). The way I see it, we come into this world as painters with blank canvases. But we color each others' canvases, not our own. Every greeting, every interaction amounts to a sharing of paint on these canvases. Over the years, the empty canvas fills and transforms into a work of art.

For those who have known me, seen me, interacted with me in some way, the 11th day of April marks the day of beginning of this life, one of many that would touch and color their lives.
For me, it was the day I joined this weird drunk-painting class.

Like a violin being lifted into a dissonant, but beautiful harmony, the singular me is granted meaning through the collection of singular others.

I want to take this day to really notice this fact, really appreciate others for their contribution in creating, for lack of better words, this work of art. And, as art, whether it's good or bad seems irrelevant. Comparing it to "other pieces" becomes meaningless to me, since there are no other works I can truly, fully see in full detail as my own.

So why not take time on your birthday to meditate on your (but not really your) painting? To open up to yourself, and then open up to others by recognizing that you are part of this beautiful collage?

I often say half-jokingly that mothers are essentially the best machine learning experts. Every child birthed is the result of future-state engineering -- hardware, firmware, software, model implementation. On top of that, mothers need to train their children using supervised learning, reinforcement learning, and sometimes unsupervised learning.
I love nature (modulo humans) as much as anybody. But humans are on a different level... in part because I am one. The complexity you can perceive even through the most basic communications channels: words, speaking, actions... it's mindblowing. And that, in my opinion, is what makes appreciating this collective of humans so worthwhile. You'll never run out of richness, depth and beauty to explore. It's the combination of complexity and the ability to comprehend that complexity.

Tuesday, July 2, 2019

Aphorisms

It's really funny how useless aphorisms are. There's always a true lesson behind each and every aphorism, but no aphorism is enough to actually teach that lesson.

It's basically a cryptographic hash function, if you think about it.* The profound lesson maps to some trite saying via this mapping we call an "aphorism". Yet the inverse function is ridiculously difficult to compute, unless you... learn the lesson through some other means. That is, the only way to figure out what an aphorism means is to actually learn the lesson behind the aphorism. Thanks, aphorism.

So what, exactly, has the aphorism taught you? What have you gained from the aphorism, besides a common way to express what you had to figure out yourself (with no help from the aphorism), and a sense of pride in finally having that a-ha moment?

Maybe it has some use in serving as a reminder for those who have already experienced it -- meaningless to anyone else, but poignant and deep to those who "get it". Kind of like a souvenir.

*I just realized that my reference to a "cryptographic hash function" is itself an example of this kind of phenomena for those who don't know what it is...

Monday, July 1, 2019

Souvenir

Do you ever do that thing where you keep a tab open thinking you'll come back to it later, but eventually because of that 50-tab limit you just overwrite it with something else, and something else, and then something else... and before you know it you've completely forgotten what you wanted to remember under a chain of internet history?

Well today, I wanted to remember. So I kept pressing back, back, back in the browser -- and found that I was scrolling through the memories of a couple nights ago, and some other night before that, but backwards.
Then I realized... these were events and experiences with people I love, that I want to remember -- much more so than the title of that book I'd been wanting to buy. By accidentally overwriting something I thought was important, I had kept for myself a souvenir of something truly important.

People know what they want. Anyone reading this will agree that cherished memories are a good thing -- this is little more than a reminder. So what's the problem?

It's more like we forget what we want. You might be so busy optimizing for all the things you think you want that before you know it, you forget to work for things that you should want (scratching an itch doesn't make you any happier after it's gone, but it sure feels good while you're doing it). Or maybe like me, you were never made aware that you wanted it or had to work for it, maybe you never even thought about it. One might call these types of wants "schematically inaccessible".

I wonder if this is the kind of thing people regret on their deathbed. How many people right now are regretting that they forgot to not forget? What other things are we forgetting to want?

Come to think of it, buying souvenirs and taking pictures have never been a priority for me. "Be present", "enjoy the moment", "people will think I'm a tourist".
I feel a bit differently now, but I'm still not going to purchase some generic model Eiffel Tower on a visit to Paris (the picture is obligatory though). But I just might take an escargot shell from that overpriced restaurant to memorialize the funny conversation we were having about snails over lunch, or something to that effect.

A relevant song

The 9/11 Memorial means something special and poignant to a small subset of us. I could empathize with the sadness while I was there... but it's nothing close to how it feels to those who were closer to the event and suffered real loss. My memories are secondhand -- my parents talking about it, videos of the event, stories... the rest I've filled in to the best of my imaginative ability.

As humans we have a tendency to build permanent things to remember other, impermanent things by. It makes sense for those who have the corresponding memory, but can that object really hold the same meaning for their children, much less their grandchildren?

Of course, when those of us who remember die, other memories take their place -- you may have had an exceptionally memorable date on the Broome Bridge, once remembered by the late Sir Hamilton as the place he wrote down the now famous "i² = j² = k² = ijk = −1". But as for those memorials who claim to represent some piece of history... those "true" memories vanish. To call it a "memorial" is to lie, if only a little.

It's an interesting perspective though -- to understand that someone, at some point, prescribed some very specific memories and emotions of an event to the structure claims to memorialize it... but no longer truly does.

Sunday, April 14, 2019

Words

Words are themselves "images". The sound of the word, or perhaps its shape on paper reminding you of its sound, is only an image -- that is, a sense experience.

Sense experience is all we know. To make "sense" of it all, we impose structure. Certain images seem similar; we group them. Sets of associated images... classes, or ideas if you will. We seem to be born with this innate sense of "association". Or at least for me. I can only think about how others think via some kind of projection and by assuming some complicated array of associations built upon frameworks upon frameworks of ideas.

A "word" is just another class: the way I said it once, the way your mother said it that one time, the visual image reaction in my mind when I hear it. The utterance of that word is but one image that reminds us of these other things: "representing the broader idea", if you will.

So a word is part of an idea, at least within this framework. But words are special for most in that they are somehow "canonical" -- most people think of language as something separate -- the medium for which to transfer ideas.

That's fine. We still have the fact that words associate to ideas.

But of course, words are also noisy and ambiguous. Describing an experience using words -- we often fool ourselves.

The purest form of the idea is simply the idea itself. But the human memory is... well, it works the way it works. And the mind likes to make associations, build structure.

By verbalizing an idea, you run the risk of associating the event, the experience, with those words. Those words, in turn, may be associated to a much larger set of experiences and ideas.

And thus, by metaphor, the particular experience may become associated to this larger set. Perhaps things become what they aren't. Or perhaps not.

Correctness can only be judged within another framework. But it happens to be that using words can lead us to "incorrectness" -- we see things that aren't there, we judge when we shouldn't. It gets worse when others are involved, and you give them those words. But that's a whole separate problem.

It's probably best to be mindful. What is there is all that is there. Labeling actions, people, events... sometimes makes it seem worse than it actually is.

Wednesday, November 22, 2017

Frequentists, Bayesians, and the Law of Large Numbers

Here is a question I answered I while back on Math StackExchange, wherein the asker seemed to be confused about the undue philosophical credit that the Law of Large Numbers (LLON) supposedly lends to frequentism by essentially proving it. Does it, in fact, say that that probability of any event must correspond to relative frequency? Of course, the LLON does no such thing; in fact, it is provable that since LLON is an artefact of a mathematical theory, it cannot force us to say anything concrete about the real world, and particularly not what interpretation we should give the word "probability". The asker actually gave a very insightful argument: if we can give two different formal probability distributions to the same real event, both of which satisfy the formal axioms of probability theory, how can we possibly interpret them both as real-life frequencies? I expound upon this argument in my answer.
Anyways, it turns out the confusion was caused by unavoidably imprecise wording arising when trying to write the LLON in layman's terms, as is often the case in mathematics (... and in many other fields -- but that's a topic for another time).

In the following aside, I define the two key terms necessary -- "frequentist" and "Bayesian" -- in the way I've used them in my answer.

First, it is important to note that both agree on a theory of probability -- which is a purely mathematical theory needing no particular interpretation. It is typically given by Kolmogorov's axioms, although other equivalent axiomatizations are possible. Classically, Bayesians interpret probability in a natural way: probability is a measure of a degree of belief. On the other hand, frequentists assert that the probability of X is only a measure of relative frequency of X: i.e. the limit of the ratio of X events to the total number of events.

But we stare at the definition for a little longer and notice something strange. The distinction seems trivial; after all, we are separating people into two different camps based on how they use the word "probability"! Surely, one wouldn't think to classify people based on what they think "atheism" means, right? (Incidentally they tend to divide neatly into named groups, but that's beside the point)But upon further inspection, we might discover the intention behind such a shorthand definition.
It's not so much about the word "probability" but what "probability" means to us, feeling-wise. That is, probability can be considered an indescribable qualia, a seemingly natural quality (or perhaps quantity?) or notion embedded in human experience but one which we have difficulty formalizing. Whatever probability is, it corresponds to a "measure of belief", and both Bayesianism and frequentistism are attempts to give it that rigorous structure. Frequentists avoid talking about belief/probability entirely and go about it with an ad-hoc approach -- instead talking about relative frequency of some event X, which if we think about it, should technically exactly inform our belief in event X: "It happens 50% of the time? What other belief should we assign that, if not 50%?"
But beyond that, they can't say much --thus one-off events (will the sun rise tomorrow?) are off limits besides a proper ad-hoc series of trials that effectively relate to it (how often do stars go supernova?).
Bayesians just literally axiomatically TALK about it directly, which isn't as problematic and subjective as it seems -- we do the same when we speak English -- it's supposedly a subjective assignment of words to sensory images in our minds, but we tend to have a standard, a collective consensus. In the same way when a Bayesian statistician says something will happen with 50% probability, we understand that their 50% is probably the same as what we ourselves, in our own minds, mean by 50% (this can actually be checked more rigorously by betting arguments -- see 'Dutch book') .

Additionally it is worth noting that the divide between Bayesians and frequentists isn't as sharp as one might imagine. Bayesians don't necessarily disagree that relative frequency informs belief, like we mentioned with the 50% example (betting arguments may further lend this support). Another interesting way that they coincide is that they are both in a sense very mathematical. Frequentists seek an objective definition for a subjective concept, and faced with the lack of this, simply characterize it conservatively -- only with what they know works, even when it might seem weaker and thus convoluted to apply in real scenarios. For a classical example, see the 20th century attempt to define "computability", or what an "effective method/algorithm" is in mathematics: Turing machines are simple and conservative, and we would definitely call everything it "computes", "computable"... but does characterize all computable functions? That is, is it mathematically equivalent, if not a definition? The answer is yes: professional consensus deems it successful in this regard, in light of many other characterizations of computability being effectively identical. Whether frequentism achieves the same level of success is open to debate.

And on the other hand, Bayesians exemplify the axiomatic method. Under a Bayesian framework, we don't need to objectively understand what belief is, since we may attach numbers to it still, and operate with a simple set of rules and end up with a coherent system. Bayesians do with "belief" what axiomatic geometry does with the concept of a "line": they leave it undefined as a primitive notion.

The actual answer from MSE follows:

You are correct. The Law of Large Numbers does not actually say as much as we would like to believe. Confusion arises because we try to ascribe too much philosophical importance to it. There is a reason that the Wikipedia article puts quotes around 'guarantees' because nobody actually believes that some formal theory (on its own) guarantees anything about the real world. All LLN says is that some notion of probability, without interpretation, approaches 1 -- nothing more, nothing less. It certainly doesn't prove for a fact that relative frequency approaches some probability (what probability?). The key to understanding this is to note that the LLN, as you pointed out, actually uses the term P() in its own statement.

I will use this version of the LLN:

"The probability of a particular sampling's frequency distribution resembling the actual probability distribution (to a degree) as it gets large approaches 1."

Interpreting "probability" in the frequentist sense, it becomes this:

Interpret "actual probability distribution": "Suppose that as we take larger samples, they converge to a particular relative frequency distribution..."

Interpret the statement: "... Now if we were given enough instances of n-numbered samplings, the ratio of those that closely resemble (within $\epsilon$) the original frequency distribution vs. those that don't approaches 1 to 0. That is, the relative frequency of the 'correct' instances converges to 1 as you raise both n and the number of instances."

You can imagine it like a table. Suppose for example that our coin has T-H with 50-50 relative frequency. Each row is a sequence of coin tosses (a sampling), and there are several rows -- you're kind of doing several samples in parallel. Now add more columns, i.e. add more tosses to each sequence, and add more rows, increasing the amount of sequences themselves. As we do so, count the number of rows which have a near 50-50 frequency distribution (within some $\epsilon$) , and divide by the total number of rows. This number should certainly approach 1, according to the theorem.

Now some might find this fact very surprising or insightful, and that's pretty much what's causing the whole confusion in the first place. It shouldn't be surprising, because if you look closely at our frequentist interpretation example, we assumed "Suppose for now that our coin has T-H with 50-50 relative frequency." In other words, we have already assumed that any particular sequence of tossings will, with logical certainty, approach a 50-50 frequency split. So is should not be surprising when we say with logical certainty that a progressively larger proportion of these tossing-sequences will resemble 50-50 splits if we toss more in each, and recruit more tossers? It's almost a rephrasing or the original assumption but at a meta-level (we're talking about samples of samples).

So this certainty about the real world (interpreted LLN) only comes from another, assumed certainty about the real world (interpretation of probability).

First of all, with a frequentist interpretation, it is not the LLN that states that a sample will approach the relative frequency distribution -- it's the frequentist interpretation/definition of $P()$ that says this.

It sure is easy to think that, though, if we interpret the whole thing inconsistently -- i.e. if we lazily interpret the outer "probability that ... approaches 1" to mean "... approaches certainty" in LLN but leave the inner statement "relative frequency dist. resembles probability dist." up to (different) interpretation. Then of course you get "relative frequency dist. resembles probability dist. in the limit". It's kind of like if you have a limit of an integral of an integral, but you delete the outer integral and apply the limit to the inner integral.

Interestingly, if you interpret probability as a measure of belief, you might get something that sounds less trivial than the frequentist's version: "The degree of belief in 'any sample reflects actual belief measures in its relative frequencies within $\epsilon$ error' approaches certainty as we choose bigger samples." However this is still different from "Samples, as they get larger, approach actual belief measures in their relative frequencies." As an illustration, imagine if you have two sequences $f_n$ and $p_n$. I am sure you can appreciate the difference between $lim_{n \to \infty} P(|f_n - p_n| < \epsilon) = 1$ and $lim_{n \to \infty} |f_n - p_n| = 0$. The latter implies $lim_{n \to \infty} f_n$ = $lim_{n \to \infty} p_n$ (or $=p$ taking $p_n$ to be a constant for simplicity), whereas this is not true for the former. The latter is a very powerful statement, and probability theory cannot prove it, as you suspected.

In fact, you were on the right track with the "absurd belief" argument. Suppose that probability theory were indeed capable of proving this amazing theorem, that "a sample's relative frequency approaches the probability distribution". However, as you've found, there are several interpretations for probability which conflict with each other. To borrow terminology from mathematical logic: you've essentially found two *models* of probability theory; one satisfies the statement "the rel. frequency distribution approaches $1/2 : 1/2$", and another satisfies the statement "the rel. frequency distribution approaches $1/\pi : (1-1/\pi)$". So the statement "frequency approaches probability" is neither true nor false: it is *independent* as either one is consistent with the theory. Thus, Kolmogorov's probability theory is not powerful enough to prove a statement in the form "frequency approaches probability". (Now, if you were to force the issue by saying "probability should equal relative frequency" you've essentially trivialized the issue by baking frequentism into the theory. The only possible model for this probability theory would be frequentism or something isomorphic to it, and the statement becomes obvious.)

Saturday, September 23, 2017

Learning hyperplanes

You don't need very many points to learn a lot! Suppose you have a given "experience space".

Each point admits a radius of applicability -- an experience allows you to extend it to some slightly different test case of some radius of difference e.

But say you're given two points far away. That means you can probably interpolate between these two points to construct a "line" covering a large number of test cases. And each point on this line also has a radius of applicability, so we have essentially a thick line. We can also extrapolate -- extend points in different directions.

Example: Suppose you have no idea what "taste" is like. You taste coffee, and you only understand things sufficiently "like" coffee. Then perhaps you taste hot chocolate. You interpolate between the two points, so you can now "recognize" things like frappuccinos and other sweet drinks.

But further you can extrapolate to the extremes, now that you understand that coffee is more "bitter" than hot chocolate, so you can extend from coffee to even more bitter things, like "espresso". You might go the opposite way from hot chocolate, to sweeter drinks. However you can't yet imagine things like fruit juice -- it's an additional dimension (sour). Of course it is a fallacy to assume that the natural-seeming taste basis -- sweet, sour, salty, bitter, spicy -- is the only one. Perhaps we can do some sort of principal components analysis. It might also be that our brains are structured in a way so that upon tasting coffee and hot chocolate, we choose this natural basis to compare the two: i.e. we break it down and say "coffee is more bitter than hot chocolate". Either way, we can at least state that we cannot find a basis vector for which coffee and hot chocolate's projection onto that vector is actually different. For example, since coffee is about as "sour" as hot chocolate, we can't exactly learn "sourness". More precisely: fruit juice will come as a complete surprise to someone who has tasted only hot chocolate and coffee.

It may seem like we're defining "learning" or "covering" as "not being surprised when we encounter it", but we can actually extend this kind of analysis to different examples of "learning". For example, with people: "learning some point (x,y,z)" in this case might mean "knowing how to act around a person with personality values (x,y,z)".

Let us rigorously analyze how it is possible we can construct "lines". It's quite simple: we follow the same analysis as we do in geometry with vectors. As we take the vector defined by the two points and add copies of it to one of our points, we might take "that quality that differentiates coffee from hot chocolate" and "add it to coffee several times" to get "espresso". The picture is kind of like:

HOTCHOC ---------------> COFFEE

Then just copy/extend:

HOTCHOC ---------------> COFFEE -----------------> ESPRESSO

Of course we're essentially doing x = t(v_0) + COFFEE, where t = 2 and x = ESPRESSO, but we let t vary on some continuum and we get a line.

Again it is important to note that in fact we have more than a line -- we have "thick line", since around every point we have a radius of applicability. So it's more like a tube.

An aside: Obviously the sort of thinking wherein you consider events/objects/etc. as tuples is pretty common in fields such as machine learning, but I'd like to point out another way it illuminates a particular phenomena.

As in music theory and learning, we notice that music was done perfectly well far before any real music theory was developed (I mentioned this in a previous post). If we model this in terms of a vector space, when humans developed music, we were discovering, say, some particular subset of a space. When we developed music theory, we stepped back at our subset and performed some principal components analysis to break it down into simpler parts. Again it is tempting to claim that the eigenvectors given by such a method are somehow canonical or natural in the sense that they ARE why we developed music a certain way, but that's not necessarily so!
For analogy, we have seen that there are several symmetric ways to look at an n-gon, and there are several bases for a vector space that are sometimes more convenient than others in particular cases. (In more mathematical detail, there are conjugate transformations that express a particular reflection along another line of symmetry in terms of the original reflection and a "change of coordinates, just as there is a change of coordinates that can express a linear transformation as a simple shear transformation in a vector space.) In the same way, there might be a another nice basis for music that perhaps more clearly explains why we call certain things music (i.e. they belong to the subset, i.e. we "recognize" it) and other things not (i.e. they don't belong to the subset) -- i.e. it might be more natural. Or perhaps there is no real "basis" -- there's some other mechanism (whim?) that causes us to identify what's music and what's not. Nobody says that we identify music by projecting the item onto some vector subspace.

Anyways, we can probably extend this to more than problems concerning "recognizing" -- a lot of different skills: musical instruments, sports, etc. consist of a family of individual micro-skills that are often after-the-fact distilled down to core skills and given structure. Mathematics itself is built the same way -- the clean organizations and categorization of subfields didn't always exist, nor do they need to!

Appendix:

Define "recognize": Essentially tasting them will be "familiar" -- won't feel new -- you can construct it as a point on the line.

Define "radius of applicability": A point has a "radius of applicability". If another point (i.e. experience) is sufficiently "similar" to our "learned" point in that we may also say this new point is roughly "learned", then we say that this new point is within the first point's "radius of applicability". We could hypothetically also quantify "roughly learned" with some finer model, but we want to begin with a simple model.

Monday, April 3, 2017

Change-of-definition does not preserve structure

There will often be words that academics use differently from laymen. For example, a lot of social scientists have a take racism to mean that, by definition, says an underprivileged class cannot be racist. The layman definition is simpler but varies between people. But universally, to any layman, racism carries a strong, personal, and negative stigma that is different from "prejudice". Being called "racist" is very much an insult. Disallowing blacks to be racist, then, provides a kind of immunity for black people, as being "prejudiced" doesn't leave the same bad taste in your mouth. So it is not entirely productive to insist on a definition that isn't widely accepted during a conversation, especially when your definition does not preserve fairness in discussion.

The situation is similar in the religion debate. The atheist community and philosophical circles along with laymen disagree on the terms "atheist" and "agnostic". To the atheists, the word "atheist" simply means those who do not believe in God. However for a long time the word "atheist" has meant, to both philosophers and laymen, strictly a person who disbelieves in God, with the word "agnostic" being reserved for those who hold no belief either way. It was in 1972 that philosopher Antony Flew attempted to change the definition of atheism to the broader sense. On this, Uri Nodelman, editor of the Stanford Encyclopedia of Philosophy, writes:

Not everyone has been convinced to use the term in Flew's way simply on the force of his argument. For some, who consider themselves atheists in the traditional sense, Flew's efforts seemed to be an
attempt to water down a perfectly good concept. For others, who consider themselves agnostics in the traditional sense, Flew's efforts seemed to be an attempt to re-label them "atheists" -- a term they
rejected.

Basically, Flew's definition is favorable to the atheist community, as it labels more people "atheists". But in the view of those who claim themselves to be agnostic in the layman's way, this definition takes away the ability to distinguish themselves from their atheists -- i.e. they lose expressibility. As such, the atheist community's definition of atheism is for the most part unused.

We see another example in the feminist movement. According to feminists, a feminist is simply one who believes in equal rights for men and women. But plenty of people hesitate to call themselves feminists -- and if such a person happens to be a celebrity, or even a female celebrity, they are scorned. It is no wonder, then, that the public's definition of feminism differs from their own. To the public, feminism means actual advocacy. The word refers to a movement, not an idea. However, it is advantageous for the movement itself that the word associates to the idea, since their movement would receive a free upgrade from a controversy-stirring movement to an idea impervious to assault.

The problem is that definitions aren't constructed by textbooks or academics. Formal definitions are post hoc. That is, definitions are written down and systematized only after a large part of society actually starts using the word in some regular fashion. That definitions are innately human, and that definitions are determined by layman consensus is essentially a tautology -- it's simply how words are used. It's not particularly right or wrong, it's just that to attempt to write down your own definition to suit your needs and trying to get everyone else to conform to it by saying "because it's right" is missing the forest for the trees.

We have an analogy in music. Music was written, and existed, long before music theory. Much like dictionary definitions, music theory is post hoc, a fact that is emphasized in music theory courses. Yes, theory provides a nice standard, but it doesn't make sense to preach theory to the thousands of brilliant musicians who don't know how to read music. Additionally, it would make much less sense to construct a music theory that disagrees with how the majority of musicians see music. Science seeks to explain the world via theory. Certain parts of social science do the opposite -- they attempt to conform the world to their theory. Perhaps this is why many do not regard social science as a true science?

In the case of atheists and feminists, it can be argued that their change-of-definitions serves mainly to facilitate communication amongst themselves and further their own groups' cause. From their perspective, this is inherently good and useful. However this actually changes the dynamic of the discussion, by disenfranchising those who wish to distinguish themselves from the group despite not opposing them completely.

In short, adopting a definition does not preserve the status quo. Change-of-basis can add to or detract from expressibility or nuance, and so essentially change the power dynamic in social dialogue.