- probability: Bayesians interpret a probability as a measure of belief, or confidence, of an event occurring.
- p(s)
- p(r|s)
- bayes p(s|r) . p(s) = p(r|s) . p(s)
- cost function

Visualizing Probability Distributions

Before we dive into information theory, letâ€™s think about how we can visualize simple probability distributions. Weâ€™ll need this later on, and itâ€™s convenient to address now. As a bonus, these tricks for visualizing probability are pretty useful in and of themselves!

Iâ€™m in California. Sometimes it rains, but mostly thereâ€™s sun! Letâ€™s say itâ€™s sunny 75% of the time. Itâ€™s easy to make a picture of that:

Most days, I wear a t-shirt, but some days I wear a coat. Letâ€™s say I wear a coat 38% of the time. Itâ€™s also easy to make a picture for that!

What if I want to visualize both at the same time? Weâ€™ll, itâ€™s easy if they donâ€™t interact â€“ if theyâ€™re what we call independent. For example, whether I wear a t-shirt or a raincoat today doesnâ€™t really interact with what the weather is next week. We can draw this by using one axis for one variable and one for the other:

Notice the straight vertical and horizontal lines going all the way through. Thatâ€™s what independence looks like! 1 The probability Iâ€™m wearing a coat doesnâ€™t change in response to the fact that it will be raining in a week. In other words, the probability that Iâ€™m wearing a coat and that it will rain next week is just the probability that Iâ€™m wearing a coat, times the probability that it will rain. They donâ€™t interact.

When variables interact, thereâ€™s extra probability for particular pairs of variables and missing probability for others. Thereâ€™s extra probability that Iâ€™m wearing a coat and itâ€™s raining because the variables are correlated, they make each other more likely. Itâ€™s more likely that Iâ€™m wearing a coat on a day that it rains than the probability I wear a coat on one day and it rains on some other random day.

Visually, this looks like some of the squares swelling with extra probability, and other squares shrinking because the pair of events is unlikely together:

But while that might look kind of cool, itâ€™s isnâ€™t very useful for understanding whatâ€™s going on.

Instead, letâ€™s focus on one variable like the weather. We know how probable it is that itâ€™s sunny or raining. For both cases, we can look at the conditional probabilities. How likely am I to wear a t-shirt if itâ€™s sunny? How likely am I to wear a coat if itâ€™s raining?

Thereâ€™s a 25% chance that itâ€™s raining. If it is raining, thereâ€™s a 75% chance that Iâ€™d wear a coat. So, the probability that it is raining and Iâ€™m wearing a coat is 25% times 75% which is approximately 19%. The probability that itâ€™s raining and Iâ€™m wearing a coat is the probability that it is raining, times the probability that Iâ€™d wear a coat if it is raining. We write this:

p(rain,coat)=p(rain)â‹…p(coat | rain)

This is a single case of one of the most fundamental identities of probability theory:

p(x,y)=p(x)â‹…p(y|x)

Weâ€™re factoring the distribution, breaking it down into the product of two pieces. First we look at the probability that one variable, like the weather, will take on a certain value. Then we look at the probability that another variable, like my clothing, will take on a certain value conditioned on the first variable.

The choice of which variable to start with is arbitrary. We could just as easily start by focusing on my clothing and then look at the weather conditioned on it. This might feel a bit less intuitive, because we understand that thereâ€™s a causal relationship of the weather influencing what I wear and not the other way aroundâ€¦ but it still works!

Letâ€™s go through an example. If we pick a random day, thereâ€™s a 38% chance that Iâ€™d be wearing a coat. If we know that Iâ€™m wearing a coat, how likely is it that itâ€™s raining? Well, Iâ€™m more likely to wear a coat in the rain than in the sun, but rain is kind of rare in California, and so it works out that thereâ€™s a 50% chance that itâ€™s raining. And so, the probability that itâ€™s raining and Iâ€™m wearing a coat is the probability that Iâ€™m wearing a coat (38%), times the probability that it would be raining if I was wearing a coat (50%) which is approximately 19%.

p(rain,coat)=p(coat)â‹…p(rain | coat)

This gives us a second way to visualize the exact same probability distribution.

Note that the labels have slightly different meanings than in the previous diagram: t-shirt and coat are now marginal probabilities, the probability of me wearing that clothing without consideration of the weather. On the other hand, there are now two rain and sunny labels, for the probabilities of them conditional on me wearing a t-shirt and me wearing a coat respectively.

(You may have heard of Bayesâ€™ Theorem. If you want, you can think of it as the way to translate between these two different ways of displaying the probability distribution!)