Intro to Number Bases and the Cambia.BaseN Package
Methodology and Design Principles Behind the Cambia.BaseN Package
By
Steve on
Monday, August 29, 2016
Updated
Wednesday, September 07, 2016
Viewed
21,679 times. (
2 times today.)
This article explores concepts in number bases and the design considerations that went into the Cambia.BaseN Nuget package.
See this article for detailed documentation on how to use the package including code samples.
Key Concepts
Before we get started let's define some key terminology to make things easier and more clear.
Radix
Sometimes just called the base, the radix is the size of the alphabet used to represent a number. We normally use base-10 {0,1,2,3,4,5,6,7,8,9}. For base-10 numbers we would say the radix is 10. Hexadecimal is base-16 and therefore the radix is 16.
Alphabet
We already mentioned the term alphabet when defining radix, but an alphabet is the set of characters used to represent a number. In large bases, digits, letters and punctuation characters may be used. That set of characters and their specific ordering is called the alphabet. All characters in the alphabet must be unique. The length of the alphabet is called the radix or base.
Base/Radix | Alphabet |
2 | 01 |
10 | 0123456789 |
36 | 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ |
4 | 3A'? |
While the binary and decimal alphabets may look familiar, the last one, the base-4 alphabet, may look very strange.
While it's not commonly (or ever) used, it is a perfectly valid, albeit strange, base-4 alphabet.
Base
The base is the same as the radix and is the length of the alphabet.
Digit
Most of us typically use the term digit to refer to the numbers 0 through 9 in our decimal numbering system. However, the term digit also applies to any character in your alphabet. So whether it's a 2 or an A or a (, if it is a character in your alphabet then we may refer to it as a digit.
We use the term digit to refer to any character in any base-n alphabet.
As an example, the hexadecimal number A56ADE contains six digits.
Why Different Bases?
Aside from simply expanding your mathematical horizons, there are many reasons to represent numbers in different bases.
Compression
One common reason in computing is simply to compress the number of bytes required to represent that number. Suppose you want to use a GUID in a URL. We generally like URLs to be as short as possible.
A GUID is represented as a hexadecimal (base 16) number containing 32 digits such as
8C55203D-5DCF-4CC3-828C-A841F5EE5668
The dashes are just for readability and historical reasons. The following is equivalent:
8C55203D5DCF4CC3828CA841F5EE5668
This number could be compressed by representing it as a safe-for-URLs base-72 number as follows:
DMB!Ii;,n+Yb5l16c!uT!
That's only 21 characters now instead of 32. A 34% reduction!
Compression of numbers isn't just for computers though. It makes processing numbers easier for humans also.
A primitive way to write the number 20 would be to scratch 20 marks on a piece of papyrus or on a rock:
||||||||||||||||||||
It takes a while to figure out that the above representation is the number 20. It's obviously much easier to look at 20 and immediately understand the value.
Rather than a base of one, a base of 10 allows us to process that information much quicker and much more accurately.
Have a look here: Number Systems and Bases for further reading on how our number systems evolved.
Computing
We also need different number bases for computers.
Computers speak binary. Their preferred alphabet contains only two digits: 01.
All numbers and all data in a computer are represented in this binary format. Obviously, binary is a bit cumbersome for humans so we have decimal (base-10) which is a little more compact and easier to parse.
You might ask whether humans would be better using a base even larger than 10, such as base 60.
Whether or not it's better I don't know, but the ancient Babylonians used a base-60 system for counting. We can still see the vestiges of this in how we represent time as 60 seconds in a minute and 60 minutes in an hours.
For some additional common ways we use different bases in everyday life, have a look at this article: Use of Bases Other than 10
Number Bases Vs. Data Encoding
Converting numbers between bases is not the same as encoding data in different bases.
It's similar, but not the same.
Bytes Versus Numbers
For one, encoding does not support negative numbers. If a negative number such as -45678 where base-64 encoded it would look like 5LaS.
What happened to the minus sign! How do we know that the number is negative?
We don't know when we do a base-64 encoding. Encoding looks at the bytes used to represent the number and encodes those bytes using a base-64 alphabet.
But if we do a base-64 conversion from base-10 to base-64 (not an encoding), -45678 becomes -64mu.
When handling numbers rather than data the minus sign has special meaning.
Although the processes are different, number conversions and data encoding both rely on a fixed alphabet.
Padding
In addition to handling of negative numbers, some encoding may also require additional processing beyond converting the data to a different base.
For example, base-64 encoding may require padding the end of an encoded string with equal signs. See RFC4648 for details on that.
A Number By Any Other Name Is Still Just a Number
Before we proceed I just want to clarify something that may get lost in all of the different number representations and funky alphabets.
Whether a number is represented in base 10 (100) or base 36 (2S) it is just a number.
It still represents exactly 100 apples, just as the word apple in English means precisely the same thing as pomme in French.
The representation has changed, but the meaning is exactly the same.
Whether a number is in base-10 or base-533, only the representation is different. The number is still the same.
Dealing With Negative Numbers
So we saw above that data encoding does not handle negative numbers, yet conversions between bases does...or should.
Since encoding and conversion have so much in common we'd like to use standard alphabets for both, if possible.
But there are some problems.
A Leading Minus
Negative numbers are traditionally represented by a leading dash such as -221. It makes sense that we would preserve this standard even in other bases such as -4AF7EB to indicate a negative hexadeximal number.
However, this has an interesting side effect. It means that we cannot use the minus character in any of our alphabets. It is now a reserved character.
What does this mean? Would we lose support for negative numbers if our alphabet contains a minus sign?
Well, yes, we would. If you have a minus sign - in your alphabet then we have to treat it like a digit rather than a negative sign.
In the Cambia.BaseN package an alphabet may contain the minus sign or not. But the behavior changes.
The alphabet class (BaseNAlphabet) has a property called SupportsNegative. SupportsNegative will be false if the minus sign is present in the alphabet.
In this case, a negative number like -456 will be converted as though the minus sign were a normal digit. It sees the number actually as a positive number!
Because of this potential confusion our Cambia95 built-in alphabet (discussed below) puts the minus sign at the end of the alphabet. As long as the base you are dealing with is less than 95 the minus sign is excluded and your conversions will support negative numbers.
The Base-64 Alphabet for Encoding is Bad for Conversions
Base-64 is a commonly used base for data encoding.
It's somewhat awkward because the alphabet requires two punctuation characters.
If the alphabet contained the digits 0..9, the uppercase letters A..Z and the lowercase letters a..z, well that's a clean set of characters. But they only amount to 62 characters. We'll have to add a couple of punctuation characters to have an alphabet with 64 unique digits.
In computing we like to use numbers that are powers of two like 64.
Perhaps there are other reasons base-64 is a common standard rather than base-62, but the point is that base-62 is a much more natural alphabet because it doesn't require any non-alphanumeric characters.
Anyway, why is this a problem? Who cares about a couple of extra puncutation characters!?
The standard base-64 alphabet from RFC4648 uses + and / as the two additional punctuation characters. This is a somewhat unfortunate choice because the plus and slash are more likely than some other punctuation to introduce conflicts in URLs or filenames.
Therefore RFC4648 introduces a "safe" alphabet for use with URLs and filenames. Instead of +/ as the punctuation chacters to round out the base 64 alphabet they replace those with dash and underscore -_.
Dash and underscore are almost universally allowed in URLs and filenames.
Unfortunately, the dash, or minus sign, creates a complication for us when dealing with negative numbers, hence our divergence from the data encoding standards.
Instead of the dash we will use the tilde ~. This is safe in URLs and filenames and allows us to support negative numbers because the minus sign will not be in our alphabet and can be reserved strictly for indicating negative numbers.
Now a leading dash will always indicate that the number is a negative number.
The tilde replacement is part of our Cambia95 alphabet described in the next section.
The Cambia95 Alphabet
There are many inconsistencies and conflicts between the various different alphabets used for encoding or for conversion.
AS RFC4648 points out there is no universal alphabet that works in all cases.
However, it strikes me that we could do better.
As part of the Cambia.BaseN package I implement an alphabet which is at least a little better than other systems at being useful in more situations and therefore something more amenable to standardization.
The goals of the Cambia95 alphabet are as follows:
- Put potentially problematic characters toward the end of the alphabet.
- Use an ordering that makes sense for truncated alphabets.
- Include the full 95 printable characters from the US-ASCII standard.
- Let the US-ASCII ordering of characters be a guide.
- Put the minus sign at the end of the alphabet in order to have the largest alphabet possible (base-94) that will support negative numbers.
Let's look at each of these in a little more detail.
1. Problematic Characters
One of the goals of the Cambia95 alphabet is that you can take the first 10, 36, 62, 64, 72, 85 characters or whatever and still have an optimal alphabet for that base.
This means the first ten elements should be 0-9 so that if you are working with base 10 that you get normal decimal numbers. It means that the first 16 characters should match the standard hexadecimal alphabet. It means that the first 62 characters should include all of the alphanumeric characters: digits, uppercase letters, lowercase letters.
Up to an alphabet of length 62 things are pretty standard. (Although the official Ascii85 alphabet puts the lowercase letters before the uppercase ones. This is less common than upper first and furthermore the US-ASCII table has upper first.)
Once we move to base-64 we have to include two non-alphanumeric characters. RFC4648 unfortunately is not ideal by supporting +/, or -_ as the "safe" alternative.
The slash is commonly used in paths and the minus would ideally be reserved for negation so that an alphabet could be maximally suitable for both encoding and number representations.
I'm sure there were other considerations at the time these choices were made so I'm not knocking the choices.
However, Cambia95 includes tilde and underscore ~_ as the 63rd and 64th characters in the alphabet. We make this choice because tilde and underscore are almost universally supported in URLs and filenames. And they don't include the minus sign.
2. Ordering for Shorter Alphabets
We touched on this in the previous section, but the idea is that you could take the first N characters of the Cambia95 alphabet and still have an optimal set of characters with the lowest possibility for problems and conflicts.
Base/Radix | Alphabet based on Cambia95 |
10 | 0123456789 |
16 | 0123456789ABCDEF |
36 | 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ |
62 | 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz |
64 | 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~_ |
72 | 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~_!$()+,;@ |
85 | 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~_!$()+,;@.:=^*?&<>[]{} |
95 | 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~_!$()+,;@.:=^*?&<>[]{}%#|`/\ "'- |
3. Include all 95 Printables
Among the many common alphabets most of them stop at a certain length such as 16, 36 or 64 and don't include all 95 printable US-ASCII characters.
Some applications of base-N conversions are intended primarily for compression of data. Many of those scenarios don't need to worry about safe characters etc.
With Cambia95 we extend the standardization to the full 95 characters to serve the situations where some additional compression would be useful.
Users can assess their own needs and truncate the alphabet to the length that is most suitable for their situation.
The shorter the alphabet the fewer the problematic characters.
By choosing a radix suitable for their situation and just truncating the Cambia95 alphabet accordingly, users can maximize their compression with a long alphabet (high radix) and minimize compatibility issues knowing that the more commonly problematic characters are toward the end of the Cambia95 alphabet and may be left out if desired.
Of course, the suitability will depend on each user's particular problem domain which is why the Cambia.BaseN package also allows you to define your own alphabet.
4. Let the US-ASCII Standard Ordering Be a Guide
When other considerations don't apply, the ordering in the US-ASCII table is considered in determining the order of characters.
For example, whether capital letters should precede lowercase letters can be decided by the ordering in the US-ASCII table. Upper precedes lower.
Sometimes we considered the visual aesthetic of otherwise equally problematic characters. But when even this doesn't seem to apply the ASCII order offers guidance.
5. Minus Sign at the End
You'll notice the minus sign (dash) is the very last, 95th, character in the Cambia95 alphabet.
This is because it is reserved as the negation character when representing negative numbers in different bases.
By having the minus sign last in the alphabet, we can offer support for negative numbers in larger bases, up to a radix of 94.
If an alphabet contains the minus character, then it can't support negative numbers.
That's fine for encoding data, but not for representing numbers in different bases.
Suppose our alphabet consists of
0123456789ABCDEF-
for a base-17 alphabet where the minus sign is the 17th digit in the alphabet.
Is the following number a negative number?
-4F7BC623A
What about these?
--4F7BC623A
-4F-BAC623A
We can't tell. This is a problem.
There are other ways we could do this that would allow us to support negative numbers while still having a minus sign in our alphabet.
For example, we could require that every number representation begin with a sign character.
In other words, every number would have to begin with - or +. If the sign character were missing it would be an invalid number.
The problem with this is that it's very non-standard. Almost universally, a leading minus sign indicates a negative number and no sign indicates a positive number. Not supporting this standard would lead to a lot of confusion.
Special Alphabets
The Cambia.BaseN package contains a class called BaseNAlphabet. It represents an alphabet like what we've been talking about above.
With this class you can define any alphabet you like so long as it has at least two unique characters. You can then convert a number into that base.
However, most of the time, you won't need to define your own alphabet. There are many standard alphabets including the Cambia95 standard we describe above which is the main alphabet used to power the Cambia.BaseN package.
We list several of the built-in alphabets below.
If the alphabet name is of the form BaseN, where N is the radix, then the alphabet is based on the Cambia95 alphabet and is just a truncated version comprised of the first N digits.
In order to include some of the other standards out there, we'll qualify the BaseN designation with additional information.
For example, Base64_RFC4648 is the standard Base-64 alphabet defined in RFC4648 and is not based on the Cambia95 alphabet.
These special alphabets are accessible as static properties on the BaseNAlphabet class:
//
// How to instantiate the special alphabets in C#
//
BaseNAlphabet alpha = null;
alpha = BaseNAlphabet.Base2;
alpha = BaseNAlphabet.Base8;
alpha = BaseNAlphabet.Base10;
alpha = BaseNAlphabet.Base16;
alpha = BaseNAlphabet.Base36;
alpha = BaseNAlphabet.Base62;
alpha = BaseNAlphabet.Base64;
alpha = BaseNAlphabet.Base64_RFC4648;
alpha = BaseNAlphabet.Base64_RFC4648_Safe;
alpha = BaseNAlphabet.Base72;
alpha = BaseNAlphabet.Base72_Safe;
alpha = BaseNAlphabet.Base85;
alpha = BaseNAlphabet.Base85_Ascii85;
alpha = BaseNAlphabet.Base94;
alpha = BaseNAlphabet.Base95;
Base2:
01
Name | Binary |
Radix | 2 |
Supports Negative Numbers | Yes |
Has Whitespace Digits | No |
Safe for URLs and filenames | Yes |
Based on | Cambia95 — first 2 digits. |
Base8:
01234567
Name | Octal |
Radix | 8 |
Supports Negative Numbers | Yes |
Has Whitespace Digits | No |
Safe for URLs and filenames | Yes |
Based on | Cambia95 — first 8 digits. |
Base10:
0123456789
Name | Decimal |
Radix | 10 |
Supports Negative Numbers | Yes |
Has Whitespace Digits | No |
Safe for URLs and filenames | Yes |
Based on | Cambia95 — first 10 digits. |
Base16:
0123456789ABCDEF
Name | Hexadecimal |
Radix | 16 |
Supports Negative Numbers | Yes |
Has Whitespace Digits | No |
Safe for URLs and filenames | Yes |
Based on | Cambia95 — first 16 digits. |
Note that in Cambia.BaseN only uppercase letters are valid hexadeximal digits. If you want to support lowercase letter you'll need to conver the string to uppercase before processing.
Base36:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
Radix | 36 |
Supports Negative Numbers | Yes |
Has Whitespace Digits | No |
Safe for URLs and filenames | Yes |
Based on | Cambia95 — first 36 digits. |
Like the base-16 alphabet only uppercase letters are valid digits. If you want to support lowercase letters on base-36 you'll need to convert the string to uppercase before processing.
Base62:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Radix | 62 |
Supports Negative Numbers | Yes |
Has Whitespace Digits | No |
Safe for URLs and filenames | Yes |
Based on | Cambia95 — first 62 digits. |
Base62 is the upper limit of clean ASCII alphabets because to go larger we must include punctuation and other ugly printable characters that are not purely alphanumeric.
Base64:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~_
Radix | 64 |
Supports Negative Numbers | Yes |
Has Whitespace Digits | No |
Safe for URLs and filenames | Yes |
Based on | Cambia95 — first 64 digits. |
Base64 is a special alphabet for data encoding, but it's not so special for number representations. Nonetheless, because base-64 is so common for encoding we include our own version which supports negative numbers and is still safe for URLs and filenames.
Base64_RFC4648:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz+/
Radix | 64 |
Supports Negative Numbers | Yes |
Has Whitespace Digits | No |
Safe for URLs and filenames | No |
Based on | RFC4648 Section 4 |
NOTE: Cambia.BaseN is for representing numbers in different bases. It does not do proper Base-64 encoding of data as described in RFC4648. Base-64 encoding requires some padding characters, usually represented by = which the Cambia.BaseN library does not concern itself with.
Base64_RFC4648_Safe:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_
Radix | 64 |
Supports Negative Numbers | No |
Has Whitespace Digits | No |
Safe for URLs and filenames | Yes |
Based on | RFC4648 Section 5 |
The plus and forward slash +/ were replaced by the dash and underscore -_ which are usually safe for URLs and filenames.
Unfortunately, because of the dash this alphabet cannot support negative numbers.
Base72:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~_!$()+,;@
Radix | 72 |
Supports Negative Numbers | Yes |
Has Whitespace Digits | No |
Safe for URLs and filenames | Yes |
Based on | Cambia95 — first 72 digits. |
This alphabet is the largest in which number representations would be valid in URLs and filenames on most operating systems.
Base72_Safe:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~_!$()+,;@
Radix | 72 |
Supports Negative Numbers | Yes |
Has Whitespace Digits | No |
Safe for URLs and filenames | Yes |
Based on | Cambia95 — first 72 digits. |
This alphabet is exactly the same as Base72. We just add this one to specifically call out that it is safe for URLs and filenames on most operating systems.
This alphabet and all shorter truncations of Cambia95 are safe for URLs and filenames. But you will need to evaluate whether it is safe for your scenario.
Longer truncations of Cambia95 could also still be valid for filenames or even possibly URLs, but 72 marks a point of increasing complexity. If you'd rather not think too hard about it, use this as your maximum sized alphabet.
Base85:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~_!$()+,;@.:=^*?&<>[]
Radix | 85 |
Supports Negative Numbers | Yes |
Has Whitespace Digits | No |
Safe for URLs and filenames | Maybe, but probably not |
Based on | Cambia95 — first 85 digits. |
Base85_Ascii85:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.-:+=^!/*?&<>()[]{}@%$#
Radix | 85 |
Supports Negative Numbers | No |
Has Whitespace Digits | No |
Safe for URLs and filenames | No |
Based on | ASCII85 |
As with Base64, the Base85 or ASCII85 alphabet is intended for data encoding and not representation of numbers in a different base. But it has a formal specification so we include support for it.
Note that this standard puts the lowercase letters before the uppercase ones. This is uncommon and the opposite order of our Cambia95 alphabet.
Base94:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~_!$()+,;@.:=^*?&<>[]{}%#|`/\ "'
Radix | 94 |
Supports Negative Numbers | Yes |
Has Whitespace Digits | Yes |
Safe for URLs and filenames | No |
Based on | Cambia95 — first 94 digits. |
Again a truncation of the Cambia95 alphabet, this alphabet includes all digit except the minus.
Base95:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~_!$()+,;@.:=^*?&<>[]{}%#|`/\ "'-
Radix | 95 |
Supports Negative Numbers | No |
Has Whitespace Digits | Yes |
Safe for URLs and filenames | No |
Based on | Cambia95 — first 95 digits. |
This alphabet is what we call the Cambia95 alphabet. See above for more details on how we forumulated the Cambia95 alphabet.
A Word About Upper and Lower Case
In situations where casing doesn't matter people often equate upper and lower case letters.
For example, in hexadecimal (base-16) people often think of a and A as equivalent.
The following two numbers are thought of as identical:
1FA68B45
1fa68b45
The following two GUIDs are thought of as identical (GUIDs are base-16 numbers):
07CBA0E1-48A8-49CA-9CB0-5897AF73C7D9
07cba0e1-48a8-49ca-9cb0-5897af73c7d9
But in a computer, A and a are completely distinct entities.
The Cambia.BaseN library will not equate upper and lower case letters.
This is because once you get to bases above 36 you have both upper and lower case letters in your alphabet. In base-62 for example A really is distinct from a.
Our standard alphabets based on Cambia95 will support only capital letters for a radix of 36 or lower.
Hexadecimal digits will only support uppercase letters.
0123456789ABCDEF
If the letters were lowercase then this would not be a valid hexadecimal number:
0123456789abcdef (invalid base-16 number)
Therefore when using the Cambia.BaseN library, uppercase your hexadecimal values before processing them if you want both upper and lowercase letters to be accepted.
Of course, life is a little simpler when working with GUIDs because the package supports them directly.