Aims
- Short, comparable with existing URL shortening services
- Infer something about the destination content type
- Pretty (avoid unnecessary encoding)
- Easily routable in our framework
- Reproducible in print (avoid easy confused characters)
- High numbers should still be short (new tracks, albums, artists)
Research
Existing URL shortening services
There are dozens of URL shortening services, they generally map a URL to an incremental key. Length is then dependent on the popularity of the service and the time the URL was for, key encoding/compression is far less important.
-
bit.ly has a 14 char fixed length (protocol = 7, domain = 6, / = 1),
eg http://bit.ly/b11vF. Unique ID is around 3-5 chars -
TinyURL has a 19 char fixed length (protocol = 7, domain = 11, / = 1),
eg http://tinyurl.com/a6lvy. Unique ID is around 3-5 chars -
tr.im has a 13 char fixed length (protocol = 7, domain = 5, / = 1),
eg http://tr.im/z2GU. Unique ID is 2-4 chars
Last.fm has a 15 char fixed length (protocol = 7, domain = 7, / = 1), this is actually very competitive with existing services. Only one character shorter than bit.ly, the most popular shorterner, and shorter than TinyURL. Our challenge is keeping the key short, and a mechanism to identify resource types and simple request routing.
Our unique IDs
While URL shortening services generally map a URL to an incremental key, we can map directly to a resource with a restype/id combination. Assuming we shorten the key by converting the ID to a higher base then our max key length is directly related to the highest resource ids. For our most common (and largest) resource types these are:
Redacted: Specific id-ranges from Last.fm tables removed
The track table clearly (and unsurprisingly) has the largest ID. Assuming a max resource ID of 1,000,000,000 makes a very future-proof upper bound for estimating max key length.
External Link
We have an External Link class that was designed to solve some of the problems we're looking at here. It would be preferable to use existing technology if possible, however there some serious limitations to this implementation:
- The key-to-URL mapping is DB backed, this would require a DB row for each resource we map, for catalogue items this would be an upper bound of ~605,000,000 rows
- Uses a limited char set (partial lower case) and is randomly generated, not sequentially
- Existing routing uses this format, http://last.fm/inlink - though could be mapped elsewhere
This service was designed with a small set of bespoke link shortenings and would not scale to map our entire catalogue. It does have the advantage of being generated on demand, this means that popular artists, tracks and albums would be much more likely to have shorter keys. However if we can archive good key compression this is not as important.
Short URL Formats
Routing
Our main technical requirement is to easily identify and route short URL requests. They should also be distinct from any existing or reasonable future URLs to avoid confusion and to prevent blocking paths we may wish to use later.
Existing patterns include;
- Tilde is a common URL convention for denoting user pages, eg
~davids
- We use the + pattern on catalogue pages, eg
/music/Radiohead/+biography
to differentiate sub-pages and albums
Resource Type Hinting
If we plan to map a resource ID to a key we also need to include a resource mapping in the URL. This also gives us a chance to add some meaning/type-hinting to the user.
- The most useful, and verbose, method is prefixing the key with the resource type, eg artist, which is in stark contrast to out shortness aims.
- A 3 letter code, eg, tag, art, album, is a trade off between clarity and length.
- The shortest option is a single character which maps to a res type. This offers little in the way of type-hinting to the user, particularly as there will be clashes (a(rtist) vs a(lbum)) and we'd need to use non-intuitive characters.
ID/Key Encoding
The simplest way reduce the length of a resource ID is to convert it to a higher base, with a larger set of characters to represent them.
Base / ID | 10^2 | 10^3 | 10^4 | 10^5 | 10^6 | 10^7 | 10^8 | 10^9 |
---|---|---|---|---|---|---|---|---|
26 (a-z) | 3m | 1cc | ekg | 5ho4 | 24n7e | lmona | 8alepm | 3647joc |
32 (Base32) | 34 | v8 | 9og | 31l0 | ugi0 | 9h5k0 | 2vbo80 | tplig0 |
52 (a-zA-Z) | 1M | jc | 3Ag | AP4 | 75GE | 1j6bA | dzacM | 2wDOpc |
55 (readable) | 3Q | mc | 5jQ | C5c | 82AQ | 377Nc | cW4RQ | 3Zhxxc |
62 (0-9a-zA-Z) | 1C | g8 | 2Bi | q0U | 4c92 | FXsk | 6LAze | 15FTGg |
64 (Base64) | 1A | fE | 2sg | oqw | 3Q90 | C9q0 | 5Zu40 | XCIE0 |
As the base increases length decreases but so does readability. A wider range of characters means a greater chance of easily confused characters (I and l or O and 0, etc), particuarly in print.
Interestingly the length of the encoded ouput does not vary significantly between the smallest and largest encoding. Inly a character or two difference in most cases. This suggests it's not worth expensing readbility for shortness, which will often not even be noticable.
We can dimiss Base64, as it contains the + and / characters, which requires URL encoding and makes it unworkable in print.
Taking the 62 character set (0-9a-zA-Z) and removing potentially confusing characters (0, 1, i, l, o, L, O
) gives us the 55 character set that is quite readable in print, with minimal increase in encoded output length.
New Base 60
While writing this document the Last.fm office had a visit from Tantek who gave a talk on some of his recent projects including NewBase60, "A base 60 numbering system using only ASCII numbers and letters."
NewBase60 was the result of Tanteks own "algorithmically reversible shortener", with similar goals in terms of readability. Rather than removing the full set of potentially confusable characters it maps a group of confusable characters (1, i, l
) to a single character from that group.
This reduces the characters lost but maintains readability as any wrongly transcribed characters will still be correct.
Unfortunately, this only brings it to 59 available encoding characters and to make an even 60 an underscore character is included in the set. While having a round number is appealing, I think I would be happier staying at 59 and excluding underscore in the interests of readability and retaining underscore as a separator or other use in URLs.
Conclusions
- Use /+ as a routing prefix to indicate a short URL. Avoids URL encoding issues and is short, attractive and readable.
- Use a single character to indicate type, only to indicate type for decoding, not for user.
- Don't try to type-hint for user benefit, without verbosity it won't help and hinders readability
- Use programmatically generated, and reversible, keys. Avoid lookup tables.
- Use a key-encoding character set is alpha-numeric, but groups confusable characters, to balance readability.
- When dealing with smaller set of items to represent I would recommend a case-insensitive, in the interests of readability.
Suggested Format
http://last.fm/ + t d4Fa
(Root Domain) (route) (type) (encoded id)
There are some other points to consider, notably auto-generated profanity, which I will try to find time to write up soon.