The best-ever designed website is no good if no one finds it, right? That’s what we mean by findability. You can’t like what you can’t find said the wise man. That’s for the “why bother on semantic html”: findability.

As to the “How” to achieve findability: Search Engine Optimisation. Hopefully the good one, coined “white hat SEO”, which is based on a good understanding of what the bots want technically – good markup – and strategically – as many inbound links to your website’s urls from relevant websites as possible.

Why the How?

Why do Search engine Crawler bots care about semantic markup and good inbound linking? Because these bots are designed to do good: serve their human audience by matching their expectations when typing a keyword in a search field.

Imagine that: a computer – which in essence is just a great calculating machine – tries to model the human way of assessing the relative value of a given piece of information against a set of human-entered keywords. In order to do that, it runs a very complex equation weighting the value and interaction of several factors.

  1. Semantic html markup, which semantics reflects the structure of an argument helps the calculator tell what the content published at a specific url is about.
  2. A great amount of inbound links from websites related to your website’s area of expertise – related terminology – is the bot’s way of assessing “peer recognition”. I picture it as the Googlebot is thinking:
    Hey, if this guys on site 1 about insects tells that this external url on site 2 is a “great collection of pictures on red ants”, then it confirms my assumption that Site 2 is indeed about ants (which are insects). If a human entomologist values site 2, I, robot, should too !

This article addresses the first point: semantic markup, and shows the latest effort to go beyond basic html and give it a better semantic weight.

Schema.org

You can only express so much with html: the structure, the hierarchy of a document. Human readers can infer from the context whether “avatar” in this page refers to a blockbuster lame-ass movie, whilst on this other page it is actually about an iconic representation of a person’s online presence. But search bots can’t. Not that they are stupid (they are, but aren’t we all?), but they lack contextual information that would allow them to target a specific acception of a term and redirect accordingly.

There have been lots of different structured data formats over the past few years that promised to allow a standardized way of dealing with data. RDFa, microformats and microdata all had some level of success, but to be really successful we need a single vocabulary – a single markup language – to use in our websites.

Google created schema.org website for that purpose. Schema.org provides a collection of schemas, or simple html tags, that webmasters can use to add additional information about their web pages. There are schemas for people, places, events, recipes, books, movies and much more.

Basically, it’s a bunch of custom attributes that allow to improve the markup and give it context. It’s like you’re answering the bot: “Hey, this DIV ‘s content refers to a Movie, dude!”.

Example:


<div>
<h1>Avatar</h1>
<span>Director: <span>James Cameron</span> (born August 16, 1954)</span>
<span>Science fiction</span>
<a href="../movies/avatar-theatrical-trailer.html">Trailer</a>
</div>

I find it way cool, but it’s just too bad it will render your page invalid. Also, there are some redundancies: shouldn’t this be included in the core html reference ? I mean,
<article> look a lot like itemtype to me, and the way you can now have several  <section><header><h1><code> </code></h1></header></section> structures in the same file is kind of a way to do that micro data thing.

My bet is this is another step forward towards a better modelling of the human mind by search engines, but the destination ain’t reached yet. Still, good to know, always good to know.