Skip to content

rev=”canonical” and extra burden on services

With debate about rev=”canonical” being the next best big thing in land of Twitter and shortening services, I wanted to throw in two extra things to consider:

eggs 2
Image by Dystopos via Flickr

How can we trust the rev=”canonical” URL? Who’s burden it is to prove that they’re correct URLs. What to do with misconfigured rev=”canonical” targets?

The proposal states that they should return 301 redirect, but this means three things for service like Twitter to check:

1. Hit original URL and parse HTML
2. Get the new URL and check if it has 301 redirect
3. (optional) in case 301 redirect is not there or is maybe other type of 3xx, does it go and check for original URL.

What to do in case of rev=”canonical” is the same URL that was just parsed, just like ArsTechnica does now. Do we say fine, lets use that long URL or we then decided on 3rd party URL shortner? (Marko points out that they’re using correct rel= and not rev).

What do you do when you can’t resolve the domain or something goes wrong in our oh-so-stable interwebs? Does HTML need to be valid or we just use regular expression to find the rev=”canonical” part?

Second question is, do we really expect services to accept this extra burden?

Off-loading tiny url generation to 3rd part service like bit.ly gives you an URL, but doesn’t guarantee you there’s anything behind it. You can easily shorten http://foo.foo into a bit.ly link.

This means that suddenly an operation that once took a single call to bit.ly, now takes at least a few magnitudes more CPU and network resources as pages need to be accessed, parsed and checked for validity. While this might be possible for smaller services, I highly doubt Twitter wants to implement this any time soon.

Any alternatives?

There might be a cheat Twitter and other services could use. If we’re so afraid that we’re lose the links, it seems that they should be kept in a database under the control of the service.

This doesn’t fully solve the problem of long term URL maintainance, but at least it’s under the control of the same provider who stores the original context (e.g. twitts), enabling them to give you nice exports and faster expansion together with one less (perceived) liability.

Reblog this post [with Zemanta]

2 responses to “rev=”canonical” and extra burden on services

  1. ArsTechnica has reL=canonical (not reV) that points to page, which it should, if this is canonical URL. It also has rev=”alternate short_url” that means same as rev=canonical. So what they do is quite right.

    There is also a proposal how provide this information through HTTP HEAD requests, which would further alleviate a problem that really doesn't exist.

    What I don't understand in all this madness is why is everyone trying to solve a problem that doesn't exist and when has Twitter become a necessary part of Internet infrastructure?

    I should probably write my own post about this…

  2. Thanks Marko, I've updated the blog entry.

    While Twitter isn't part of infrastructure I believe that we're worried about historical data in proprietary silos.

Comments are closed.