Short URLs

Aims

Short, comparable with existing URL shortening services
Infer something about the destination content type
Pretty (avoid unnecessary encoding)
Easily routable in our framework
Reproducible in print (avoid easy confused characters)
High numbers should still be short (new tracks, albums, artists)

Research

Existing URL shortening services

There are dozens of URL shortening services, they generally map a URL to an incremental key. Length is then dependent on the popularity of the service and the time the URL was for, key encoding/compression is far less important.

bit.ly has a 14 char fixed length (protocol = 7, domain = 6, / = 1),
eg http://bit.ly/b11vF. Unique ID is around 3-5 chars
TinyURL has a 19 char fixed length (protocol = 7, domain = 11, / = 1),
eg http://tinyurl.com/a6lvy. Unique ID is around 3-5 chars
tr.im has a 13 char fixed length (protocol = 7, domain = 5, / = 1),
eg http://tr.im/z2GU. Unique ID is 2-4 chars

Last.fm has a 15 char fixed length (protocol = 7, domain = 7, / = 1), this is actually very competitive with existing services. Only one character shorter than bit.ly, the most popular shorterner, and shorter than TinyURL. Our challenge is keeping the key short, and a mechanism to identify resource types and simple request routing.

Our unique IDs

While URL shortening services generally map a URL to an incremental key, we can map directly to a resource with a restype/id combination. Assuming we shorten the key by converting the ID to a higher base then our max key length is directly related to the highest resource ids. For our most common (and largest) resource types these are:

Redacted: Specific id-ranges from Last.fm tables removed

The track table clearly (and unsurprisingly) has the largest ID. Assuming a max resource ID of 1,000,000,000 makes a very future-proof upper bound for estimating max key length.

External Link

We have an External Link class that was designed to solve some of the problems we're looking at here. It would be preferable to use existing technology if possible, however there some serious limitations to this implementation:

The key-to-URL mapping is DB backed, this would require a DB row for each resource we map, for catalogue items this would be an upper bound of ~605,000,000 rows
Uses a limited char set (partial lower case) and is randomly generated, not sequentially
Existing routing uses this format, http://last.fm/inlink - though could be mapped elsewhere

This service was designed with a small set of bespoke link shortenings and would not scale to map our entire catalogue. It does have the advantage of being generated on demand, this means that popular artists, tracks and albums would be much more likely to have shorter keys. However if we can archive good key compression this is not as important.

Short URL Formats

Routing

Our main technical requirement is to easily identify and route short URL requests. They should also be distinct from any existing or reasonable future URLs to avoid confusion and to prevent blocking paths we may wish to use later.

Existing patterns include;

Tilde is a common URL convention for denoting user pages, eg ~davids
We use the + pattern on catalogue pages, eg /music/Radiohead/+biography to differentiate sub-pages and albums

Resource Type Hinting

If we plan to map a resource ID to a key we also need to include a resource mapping in the URL. This also gives us a chance to add some meaning/type-hinting to the user.

The most useful, and verbose, method is prefixing the key with the resource type, eg artist, which is in stark contrast to out shortness aims.
A 3 letter code, eg, tag, art, album, is a trade off between clarity and length.
The shortest option is a single character which maps to a res type. This offers little in the way of type-hinting to the user, particularly as there will be clashes (a(rtist) vs a(lbum)) and we'd need to use non-intuitive characters.

ID/Key Encoding

The simplest way reduce the length of a resource ID is to convert it to a higher base, with a larger set of characters to represent them.

Base / ID	10^2	10^3	10^4	10^5	10^6	10^7	10^8	10^9
26 (a-z)	3m	1cc	ekg	5ho4	24n7e	lmona	8alepm	3647joc
32 (Base32)	34	v8	9og	31l0	ugi0	9h5k0	2vbo80	tplig0
52 (a-zA-Z)	1M	jc	3Ag	AP4	75GE	1j6bA	dzacM	2wDOpc
55 (readable)	3Q	mc	5jQ	C5c	82AQ	377Nc	cW4RQ	3Zhxxc
62 (0-9a-zA-Z)	1C	g8	2Bi	q0U	4c92	FXsk	6LAze	15FTGg
64 (Base64)	1A	fE	2sg	oqw	3Q90	C9q0	5Zu40	XCIE0

As the base increases length decreases but so does readability. A wider range of characters means a greater chance of easily confused characters (I and l or O and 0, etc), particuarly in print.

Interestingly the length of the encoded ouput does not vary significantly between the smallest and largest encoding. Inly a character or two difference in most cases. This suggests it's not worth expensing readbility for shortness, which will often not even be noticable.

We can dimiss Base64, as it contains the + and / characters, which requires URL encoding and makes it unworkable in print.

Taking the 62 character set (0-9a-zA-Z) and removing potentially confusing characters (0, 1, i, l, o, L, O) gives us the 55 character set that is quite readable in print, with minimal increase in encoded output length.

New Base 60

While writing this document the Last.fm office had a visit from Tantek who gave a talk on some of his recent projects including NewBase60, "A base 60 numbering system using only ASCII numbers and letters."

NewBase60 was the result of Tanteks own "algorithmically reversible shortener", with similar goals in terms of readability. Rather than removing the full set of potentially confusable characters it maps a group of confusable characters (1, i, l) to a single character from that group.

This reduces the characters lost but maintains readability as any wrongly transcribed characters will still be correct.

Unfortunately, this only brings it to 59 available encoding characters and to make an even 60 an underscore character is included in the set. While having a round number is appealing, I think I would be happier staying at 59 and excluding underscore in the interests of readability and retaining underscore as a separator or other use in URLs.

Conclusions

Use /+ as a routing prefix to indicate a short URL. Avoids URL encoding issues and is short, attractive and readable.
Use a single character to indicate type, only to indicate type for decoding, not for user.
Don't try to type-hint for user benefit, without verbosity it won't help and hinders readability
Use programmatically generated, and reversible, keys. Avoid lookup tables.
Use a key-encoding character set is alpha-numeric, but groups confusable characters, to balance readability.
When dealing with smaller set of items to represent I would recommend a case-insensitive, in the interests of readability.

Suggested Format

http://last.fm/    +        t          d4Fa

(Root Domain)   (route)   (type)   (encoded id)

There are some other points to consider, notably auto-generated profanity, which I will try to find time to write up soon.