Entropy and Quantifying Information

Back in the early 1900s electrical engineers were focused on a problem: how can we send more information, faster. This was, of course, of major interest to the military as it allowed for faster response and better coordination.

Speed and Communication in the 1900s

If you wanted to send information across long distances you did it using a telegraph and Morse code. A message was encoded using a series of beeps and, on the receiving end, the receiver would listen to those beeps and decode them. If you had two very good telegraphers you could trasmit 10 - 15 words a minute.

Not very fast, especially when 5 of those words not essential to the information in the message itself:

REGRET TO INFORM YOU YOUR BARN BURNED DOWN DURING THE NIGHT | YOUR COWS ARE SAVED | THE PIGS WERE LOST

The Message

Sending information from one point to another involves a few terms we need to get to know:

A sender sends the information over a
Channel to a
Receiver using a
Transmitter
The information is contained in a message

In the message above, the receiver is informed that their barn burned down, which is something the receiver didn't know prior to receiving the message. Here's another message that might follow:

WEATHER CLEAR TODAY IN THE 70s WITH SLIGHT CHANCE OF RAIN | CLEANUP SHOULD BE CARRIED WITHOUT PROBLEM

These two messages contain roughly the same number of characters but one message is a bit more impactful, or surprising than the other. Learning that your barned burned down has much more impact (and therefore more information) than a mundane weather report. But how much more?

All Possible Messages

If you lived in the early 1900s and stood next to your telegraph, staring at it, you would receive a number of messages:

Nothing. Believe it or not, something not happening is information too! There are nearly infinite possibilities of a particular thing happening in the future, some more probable than others (which is important to us - we'll come back to that). When nothing happens, that's not surprising at all and therefore contains very little information.
A message comes in with the weather report. "Yes yes" you think, "I can see outside and I can tell it'll be a cloudy day". That message does contain information, but not very much. Of all the possible messages that could come in, that was the most predictable.
Later that day you get another message, explaining that it wasn't your barn that burned down, it was your neighbors! This has a TON of information! You didn't expect this at all (though maybe you hoped for it because you hate your neighbor).

Messages don't just happen, they are "pulled from the chaos" of all possible things that could happen. This is the main definition of a message that we'll be working on from here on out:

It marks the intersection of probability to causality. Future to past. The one, finite thing that happened out of all possible things.
It defines how much you learned after receiving the message.

It turns out that this is the key to quantifying information.

Entropy

Entropy is the "degree of randomness" of a system. If we're talking about a binary system (1s and 0s) there is very little entropy: something can be a 1 or a 0. If we're talking about a message in Morse code there's a bit more entropy because we have 26 uppercase alpha-numeric characters, 10 digits and a space.

If the message is in the English language, there are 240,000 possible words. We can pare that down a bit to "common words" which brings it to more like 20 or 30,000. Either way - the degree of randomness tells us a lot.

The idea is a simple one: the greater the entropy, the greater the information a message contains. This only works, however, if we separate the meaning of a message from the information it contains. Yes, it's sad if your barn burns down! It's the chance of you reading those words, however, that define the information.

Fire, cows saved, pigs lost.

Those words, in that order, have a very low probability and thus a high amount of information. Ralph Hartley, the first engineer to try to calculate the speed of information, gave us the very first equation to express this idea:

H(M) = log |M|

The entropy, H, of a message M is equal to the log of all possible messages from that system |M|. In the case of our barn burning down, "all possible messages" would be the 240,000 words in the English language.

To calculate the entropy of this message we simply need to multiply the entropy of each word. If I tell you your barn burned down, that's 4 words, or individual messages that make up the whole:

H("YOUR BARN BURNED DOWN") = log(240000) * 4

Entropy of Your Barn Burning

That gives us "21 Hartleys" of information, a unit of measure you've probably not heard before. Me neither. But that's not really the weird thing here.

We have a bigger problem, which is that I could have said "THE SKY IS BLUE" and, using Hartley's equation, that would contain the exact same amount of information as "YOUR BARN BURNED DOWN", which is absolutely not the case.

The Shannon Entropy

This is where we get to meet Claude Shannon again, who expanded on Hartley's idea, taking into account the probability of each message. To Shannon, the amount of information in a message is inversely related to its probability - something Shannon called the "surprise" of a message.

That, Shannon reasoned, is the true measure of information and he calculated it using a simple probability equation:

s(M) = log2(1/p(M))

The surprise, s of a given message M is equal to the log (base 2) of 1/p(M). Why log 2? Because Shannon was working in the world of digital communication, sending messages using binary meant that the 1 or 0 was the correct way of thinking about (and quantifying) information.

So, the surprise of the message "YOUR BARN BURNED DOWN" is directly tied to the probability of each of those words coming in. "YOUR" is pretty normal, "BARN" isn't all that common but following "YOUR" it's not unexpected. "BURNED" is extremely unexpected but, once again, the word "DOWN" is not, because it follows the word "BURNED".

To Shannon, the words "YOUR" and "DOWN" would be redundant in this message and could likely be stripped out, conveying close to the same idea: "BARN BURNED". Not exactly the same as there is a bit of ambiguity, but this is a very important breakthrough!

Shannon gave us a way to think about shortening a message without losing too much information. Shorter messages means more efficient transmission! Brilliant!

This led to Shannon's Entropy equation, which is one of the most important equations ever produced:

Shannon's Entropy

The Seeds of Data Compression

If all of this has seemed academic up until now, this is where it all gets real. We use data compression every day - from GZIP on web pages to image compression in our texts. It all started right here with Claude Shannon's realization that words could be stripped from a message without altering its information.

We'll get to that in the next video.