Designing Short URLs (for Last.fm)

By David Singleton

I wrote this document in 2009 while working at Last.fm. It shows my approach to designing a short URL system for Last.fm pages.

I've published this with the hope that, despite some Last.fm specific bits, it's useful for anyone else embarking on the design of a short URL system.

Aims

Research

Existing URL shortening services

There are dozens of URL shortening services, they generally map a URL to an incremental key. Length is then dependent on the popularity of the service and the time the URL was for, key encoding/compression is far less important.

Last.fm has a 15 char fixed length (protocol = 7, domain = 7, / = 1), this is actually very competitive with existing services. Only one character shorter than bit.ly, the most popular shorterner, and shorter than TinyURL. Our challenge is keeping the key short, and a mechanism to identify resource types and simple request routing.

Our unique IDs

While URL shortening services generally map a URL to an incremental key, we can map directly to a resource with a restype/id combination. Assuming we shorten the key by converting the ID to a higher base then our max key length is directly related to the highest resource ids. For our most common (and largest) resource types these are:

Redacted: Specific id-ranges from Last.fm tables removed

The track table clearly (and unsurprisingly) has the largest ID. Assuming a max resource ID of 1,000,000,000 makes a very future-proof upper bound for estimating max key length.

We have an External Link class that was designed to solve some of the problems we're looking at here. It would be preferable to use existing technology if possible, however there some serious limitations to this implementation:

This service was designed with a small set of bespoke link shortenings and would not scale to map our entire catalogue. It does have the advantage of being generated on demand, this means that popular artists, tracks and albums would be much more likely to have shorter keys. However if we can archive good key compression this is not as important.

Short URL Formats

Routing

Our main technical requirement is to easily identify and route short URL requests. They should also be distinct from any existing or reasonable future URLs to avoid confusion and to prevent blocking paths we may wish to use later.

Existing patterns include;

Resource Type Hinting

If we plan to map a resource ID to a key we also need to include a resource mapping in the URL. This also gives us a chance to add some meaning/type-hinting to the user.

ID/Key Encoding

The simplest way reduce the length of a resource ID is to convert it to a higher base, with a larger set of characters to represent them.

Base / ID 10^2 10^3 10^4 10^5 10^6 10^7 10^8 10^9
26 (a-z) 3m 1cc ekg 5ho4 24n7e lmona 8alepm 3647joc
32 (Base32) 34 v8 9og 31l0 ugi0 9h5k0 2vbo80 tplig0
52 (a-zA-Z) 1M jc 3Ag AP4 75GE 1j6bA dzacM 2wDOpc
55 (readable) 3Q mc 5jQ C5c 82AQ 377Nc cW4RQ 3Zhxxc
62 (0-9a-zA-Z) 1C g8 2Bi q0U 4c92 FXsk 6LAze 15FTGg
64 (Base64) 1A fE 2sg oqw 3Q90 C9q0 5Zu40 XCIE0

As the base increases length decreases but so does readability. A wider range of characters means a greater chance of easily confused characters (I and l or O and 0, etc), particuarly in print.

Interestingly the length of the encoded ouput does not vary significantly between the smallest and largest encoding. Inly a character or two difference in most cases. This suggests it's not worth expensing readbility for shortness, which will often not even be noticable.

We can dimiss Base64, as it contains the + and / characters, which requires URL encoding and makes it unworkable in print.

Taking the 62 character set (0-9a-zA-Z) and removing potentially confusing characters (0, 1, i, l, o, L, O) gives us the 55 character set that is quite readable in print, with minimal increase in encoded output length.

New Base 60

While writing this document the Last.fm office had a visit from Tantek who gave a talk on some of his recent projects including NewBase60, "A base 60 numbering system using only ASCII numbers and letters."

NewBase60 was the result of Tanteks own "algorithmically reversible shortener", with similar goals in terms of readability. Rather than removing the full set of potentially confusable characters it maps a group of confusable characters (1, i, l) to a single character from that group.

This reduces the characters lost but maintains readability as any wrongly transcribed characters will still be correct.

Unfortunately, this only brings it to 59 available encoding characters and to make an even 60 an underscore character is included in the set. While having a round number is appealing, I think I would be happier staying at 59 and excluding underscore in the interests of readability and retaining underscore as a separator or other use in URLs.

Conclusions

Suggested Format

http://last.fm/    +        t          d4Fa

(Root Domain)   (route)   (type)   (encoded id)

There are some other points to consider, notably auto-generated profanity, which I will try to find time to write up soon.