In many situations, it’s necessary to have a unique identifier for a person or thing, and such identifiers are really useful when we as humans can use them for verification and trust purposes. Many people can have the same name, and so things like social security numbers, phone numbers, drivers license numbers have been used to uniquely identify people. In the past, memorizing 10-digit phone numbers of all your associates was more commonplace, but as phone numbers went from being “party lines” of many families, to “land lines” representing one family of multiple people, to mobile phone numbers where every person has one, it became harder to memorize them all.
Online, usernames and email addresses have been used as a unique identifier for individual users of a service, but as more and more people use online services, username collisions occur, and we need longer and longer usernames to stay unique (which makes them harder to memorize), and one person may not have the same username across all services. Email addresses have gotten more commonplace, and each person now likely has even more email addresses than phone numbers!
So, what’s the future of identifiers? How can we uniquely identify all the people and things we want to keep track of in this “Internet of Things” connected web we live in now?
The Key to the Problem
The solution of a “universally unique identifier” (UUID) has been a part of computer science for a while, and the easiest solution is to assign a number to everything.
Though, if there’s no central authority on registering who has which number, there’s a problem if two things are assigned the same number. To get around that, instead of assigning numbers sequentially, have everyone pick a random number from the pool, and make sure the pool is big enough that the odds of picking the same number are rare.
The current standard for GUIDs is to use a 128-bit number as the identifier. 128-bits gives you 2¹²⁸ different possibilities (340,282,366,920,938,463,463,374,607,431,768,211,455 possibilities, or about 10³⁸ possibilities). Given that there’s only about 7 billion people on the planet (10⁹), that means each person could have a different identifier for each other person on the planet and still not use those up.
Several digital cryptocurrency models, including Bitcoin, Ethereum and others push that number up higher, using a 160-bit number to represent each “address” in its ecosystem (2¹⁶⁰ possibilities, 1,461,501,637,330,902,918,203,684,832,716,283,019,655,932,542,976 possibilities, about 10⁴⁸ possibilities).
IPFS uses numbers 256-bits in size (as a current default, with the option to add longer identifiers too), since it needs to accommodate for every version of every document ever created to have its own identifier.
Now, those sort of identifiers are great in terms of uniqueness and can be used for great security (hard to guess), so computers love them, but really are horrible when it comes to using them as a human. If you are trying to make a friend request to your buddy who’s already got an ID assigned, typing out that ‘username’ into a search field is going to be frustrating, and your buddy’s profile page on that service needs a lot more screen real estate to show off that longer ‘username’. So how can we keep the security of long identifiers, but add more usability for humans who interact with them?
Scan for Efficiency
Being able to identify a thing only is really useful to a human if we can accurately verify that the identifier is what we expect. If I tell you my username is “midnightlightning”, and you arrive at a user’s profile that’s “midnight1ightning”, are you on my page? Hopefully it doesn’t take you too long to figure out if those two usernames match (they don’t; there’s one letter that’s different).
But now, if I told you my Bitcoin address is 1MgPvHbMKCXZqeBpDTPH9a8B9cNr8E7hv6, and you arrive at a payment page that is sending coins to 1MgPvHbMKCXZqeBpDTPH9a8B9cNr8E7hv6, are you sending them to the right person? Did you find the difference between those two identifiers? No? Good, because they’re actually the same. But how long did it take you to determine that? If they were vertically-aligned it would be easier to scan by eye character by character, but very seldom would that situation arise.
Most of the time a person would look at the first 4–8 characters and the last 4–8 characters and check that they match and call it good. But knowing that, it’s relatively easy to generate another address that has the first four and last four characters the same as another address (by brute-force generating random addresses until you find one that matches). So, even though it’s an important thing to verify that a Bitcoin address is correct before sending funds to it, it’s a frustrating experience to do properly as a human user.
So how can we help humans compare them better, knowing that attackers may try and generate an identifier very similar to the real one?
One thing the Ethereum Mist wallet application does is use identicons by each identifier/address. Identicons are simple shape-and-color pictures, which are generated from the identifier number in such a way that identifiers that only differ by a character or two generate wildly different images. It’s not foolproof, as there’s a lot less possible identicons than there are addresses, so there are different addresses that will generate the same identicon. But those addresses are likely very different from each other, so even just scanning the first four characters of the address would tell you it’s wrong.
This works within the application, but breaks down if you use the same scheme on a website that’s displaying Ethereum addresses. How do you trust that the website is showing the true identicon for that address, and not trying to con you into trusting a bad address? I think this scheme could work if implemented as a browser plugin: the plugin code could be audited for truthfulness, and then on any website, hovering over anything that could be an Ethereum address could show a popup with the identicon of that address. Plus an area to paste in an address from the clipboard, so you can compare the identicon for an address from another source (e.g. email) to one being displayed in a website.
So, an identicon can help, but you still need to check against the actual identifier, so not a bulletproof solution.
Beyond the Digits
Part of the pain of verifying a UUID is how long the number is when expressed in decimal numbers. Using decimal numbers, there’s only ten characters used to represent numeric values, so there’s lots of characters needed to represent a large number, but the upshot for human perception is we all know that those ten characters (0123456789) mean that the thing being represented is a number, and not some other data (a word or password).
To make an identifier more recognizable to a human, shortening the length of it would help. For a number, that can be done without losing specificity by using a higher radix numbering system. Hexadecimal uses 16 characters (0123456789ABCDEF) to represent numbers, with ‘A’ representing 10, ‘B’ representing 11, etc. You get some shortening with low numbers just by being able to express numbers 10–16 in one character, while in decimal it would take two. For higher-value numbers, the second digit of a hexadecimal number represents “groups of 16” rather than “groups of 10” in decimal, and the third digit is “groups of 256” instead of “groups of 100” (16² vs. 10²). The differences between the value of the places gets bigger and bigger as this progresses, such that a number that takes 15 digits in decimal (100,000,000,000,000, or 10¹⁵) takes only 11 digits to represent in hexadecimal (12th digit of a hexadecimal number is 16¹² = 281,474,976,710,656, which is greater than 10¹⁵).
So we can make it slightly shorter by using 16 characters to represent numeric values (“radix-16 number”). And indeed the standard for representing UUID values is to show them in hexadecimal, with hyphens in certain places to make it easier for the human eye to compare clusters of numbers. So, a UUID would typically be shown as:
Rather than its decimal representation, which is:
But can we push that further? Bitcoin does take this out to using 58 characters (123456789 ABCDEFGHJKLMNPQRSTUVWXYZ abcdefghijkmnopqrstuvwxyz) to create a radix-58 number. The 58 characters used were picked from the common latin number and alphabet characters, excluding those that looked similar to each other (zero and capital ‘o’, capital ‘i’, and lower-case ‘L’). Removing characters that look similar helps with hand-written identifiers or ones using fonts that don’t distinguish well between them.
So, my Bitcoin address that I mentioned earlier is typically presented as:
Rather than the decimal representation, which is:
Base64 encoding uses 64 characters to represent numeric values. It’s efficient for computers since 64 is a power of 2, so is easy for a computer to store. It takes even less characters to represent a number in base64 than in hexadecimal, but you are now using characters that could be mistaken for each other. Base64 sits in an uncomfortable middle-ground where it’s very efficient for computers, but starting to look like random garbage to a human. Does it help to go out further? The standard ASCII alphabet has 94 printable characters (the others are whitespace, carriage returns, and other control characters). Using all of them to represent numeric values would be a radix-94 number, but that would include characters like slashes and quotation marks, that generally have special meaning in computer code, so can be problematic to even display in a website as those special characters would need to be escaped out.
So, a radix-94 number is able to represent a numeric value in only a few characters, but it also then loses a lot of its identifiability for human interpretation at a glance. Randomly-generated “secure passwords” often end up looking like this; a jumble of symbols that no human actually memorizes, with its main virtue being its very short, and little else.
Pushing it to the extreme of shortness like that ends up being not as useful, but what other options are there? The Bitcoin Improvement Proposal #39 (BIP39) started out as a way to translate/encode any binary data as a mnemonic sentence (the final version of that proposal was changed to not be a two-way encoding, but the original idea is relevant here). It proposed a mnemonic sentence idea, where a list of 2,048 words were each assigned a numeric value, and then a sentence constructed of those words, effectively making a radix-2048 number. Now, you only need a few words to describe a large number, though the words themselves are multiple characters so take up more space printed out. But, we have many benefits with this trade-off:
- Because the representation is words, all existing spell-checkers help correct them, and on mobile devices it’s easier to type out as mobile keyboards tend to help type real words.
- Text highlighting for copy/pasting is more natural
- Humans have a much easier time memorizing them (e.g. “correct horse battery staple”)
- The identifier can wrap in different places as there’s spaces in words, so large horizontal stretches of screen real estate don’t need to be reserved.
Implementing such a scheme has many options that would need to be standardized. 2,048 words is a short enough list that you can make every word have a unique four letter start to it in English, and different word lists for different languages could be devised to confer the same typing and spell-checking benefits to other nationalities.
Words are something we’ve been dealing with as humans for all of recorded history, so we’re quite familiar with, but can we also make it shorter? In the recent modern age, there’s a new player on the block that represents whole words/concepts in a single character: emoji! 😄👍🚀🎉 The list of emoji included in the Unicode standard keeps growing, and is 1,085 characters currently (this excludes flags and “zero-width joiner” (ZWJ) sequences). That’s slightly more than 1,024, which is a power of two, so works well for computer storage and transmission. So, if we made a “word list” of 1,024 emoji characters, we could use that to form a radix-1024 representation of large numbers with only one character per digit.
Here’s a sample implementation of that, setting 1,024 emojis to represent 0–1,023: https://github.com/MidnightLightning/radix-emoji
As an example, taking the first few characters from the standard “Man is distinguished, not only by his reason…” quote, the first five characters (“Man i”) can be expressed with only four emoji: ❎🏀📱🐻
Saving one character isn’t all that impressive, but that’s encoding ASCII values. Going back to that Bitcoin address of mine, in standard base-58 encoding it is:
Which in hexadecimal is:
That’s 25 bytes of data, which can be expressed as only 20 emojii characters:
That’s better than the 34 characters needed for base-58 representation, or 50 characters needed for hexadecimal representation. Now, showing that on-screen may take up slightly more horizontal space than the base-58 one, but in terms of human readability, it contains more that is immediately recognizable. Try comparing that address to 🆓🕒🚣🍮⛰🍂🗣☪🐌🏒⚫💃🚧🙉👸🃏🍤🏓🙋⏱🎂. Even though it wraps and is not vertically-aligned, how long does it take you to determine if they are the same? (they’re not; outbox has become black joker).
So, what do you think; easier to utilize, and compare? This idea would need further fleshing out, to figure out some sort of padding scheme if the data to be encoded is not evenly divisible by 5 bytes, and make sure the list of emojis is a set everyone can utilize, but I’d love to hear what others think of this sort of identifier identification!