How to count in binary using any ordered sequence of symbols
A precursor to how to generate all subsequences of an ordered sequence
If a sequence X has m symbols then the number of possible subsequences of X is 2m โ CLRS, Introduction to Algorithms
Intro
In this article I'll be showing how to count using any ordered sequence of symbols, which I'll claim is useful when generating all subsequences of that ordered sequence.
This may seem obtuse, not useful, and as though it would never come up in a real world setting, so to motivate this subject I'll simply note that a real world use case is in generating subsequences of DNA sequences,
which is a specific instance of generating subsequences of strings or more broadly, sets of values, which is something software engineers can find themselves needing to do often.
Why read this
During my morning reading of Introduction to Algorithm, I came across the quoted claim from earlier,
If a sequence X has m symbols then the number of possible subsequences of X is 2m โ CLRS, Introduction to Algorithms
but it didn't seem intuitive to me and I felt a more intuitive value for the total number of X's ordered subsequences would be m! (m factorial) .
If like me you look at 2m but don't think it's an intuitive value for the total number of ordered subsequences of X, then you may like this article.
Why 2m and not m!
Simply put, m! would give us the number of times we can rearrange all m symbols in the sequence of X. The very fact that we're rearranging breaks the sequencing of the symbols, and so we can quickly discount this number.
Okay, but why not m2?
To which I would say fair point, because, given a sequence X, and two empty boxes A and B, each having m slots, we can generate two ordered subsequences by repeatedly picking a symbol from X, and placing in either box A or box B while preserving symbol order.
From which it's easy to think that box A and B can each take on m possible values, and given that there are just two boxes, the total possible values these can represent is m2.
And easier still to mistakenly extrapolate that value m2 to be the total number of subsequences of X.
Unfortunately, it's wrong to say that A and B can each take on m possible values. I think it's more correct to say they can each take on m combination m/2 values, which resolves to some factorial.
Which using Wolfram Alpha, we can see grows really quickly to become much larger than m for sequence lengths as small as 15.
So we can conclude that it's also wrong to think that m2 gives us the number of ways in which we can form two sub sequences out of X.
So why 2m
Let's focus on what 2m means in terms of counting.
It means we have m possible values and each can take on only 2 states.
In other words, it means if we had a binary number sequence like 1 1 (read as 1, 1, not 11), then 2m, which in this case equals 22 = 4, represents the total number of subsequences we can derive from the sequence " 1 , 1 ".
Which are
0: " , " otherwise represented in binary " 0 , 0 "
1: " , 1 " otherwise represented in binary as " 0 , 1"
2: " 1 , " otherwise represented in binary as " 1 , 0 "
3: " 1 , 1 "otherwise represented in binary as " 1 , 1 "
Okay, how does that help us?
We'll use that to help us count in binary using any ordered sequence of symbols by following these three steps:
- Take a sequence of ordered symbols
- Represent all possible states for each symbol
- Count!
1 - Take a sequence of ordered symbols
Given the sequence is ordered, we can easily replace the binary symbols with any symbols of our choice. So let's take a simple sequence " D , K "
And note that when generating a subsequence, each symbol in the original sequence can take on only one of two possible values. The symbol value itself, or no value at all.
Side note: with quantum computing, the number of possible states each symbol can take on expands to three, but thank goodness we don't have to worry about writing quantum programs just yet.
2 - Represent all possible states for each symbol
Each symbol has two possible states in a subsequence, itself or no value at all. Let's represent them for " D , K " as such:
- D: D or !D
- K: K or !K
This way, for any new symbol N we can simply use !N to represent it's no value state.
3 - Count
Now that we have our representations, let's use them as though counting in binary.
0: " , " otherwise represented as " !D , !K"
1: " , K " otherwise represented as " !D , K"
2: " D , " otherwise represented as " D , !K "
3: " D , K " equally represented as " D , K"
and in this way, when we eventually reach the original sequence, we will have generated every possible subsequence along the way, which for "DK" are
"", "K", "D", and "DK"
Conclusion
Ordered subsequences are really just binary numbers with different symbols, so I hope this has been fun to read, and has shown how given any ordered sequence, when trying to generate all possible ordered subsequences we can simply think of doing this as counting up to a binary number represented by the original symbol sequence ๐
Attributions
- Image gotten from this DNA sequencing article