The semantics of citizen science

The Web is a source of rich and varied content. If that content, or at least the important data within it, could be accessible not only to humans but to computers as well, it could then be filtered, grouped, analyzed, and served to citizens in ways humans would be hard-pressed or unable to achieve themselves. The Semantic Web transformed the World Wide Web into exactly that sort of system. This idea is full of both technical possibilities and market opportunities (Bonner, 2002).


Like its predecessor, the Semantic Web rides on the Internet backbone and uses HTTP’s simple methods (get, post, and put) to transfer resources located and addressed through unique universal resource identifiers (URIs). But whereas the content presented on the World Wide Web is designed almost entirely with people in mind, the content of the Semantic Web is intended to be optimally accessible and comprehensible to automated software agents. The simple change of augmenting human-readable content with machine-comprehensible content (or the production of pages or even entire sites of purely machine-oriented content) could have a huge effect.

Consider the seemingly simple task of determining a company’s mailing address from the “Contact us” or “About us” page of its website. If a human were going to perform that task herself, she would navigate to the page, then scan the page until she located the address. A software program, lacking the extraordinary pattern recognition skills that make the task child’s play for a human, would have a harder time finding the address. Unless the address is preceded by a specific label such as Mailing Address Starts Here: and followed by Mailing Address Ends Here and a program is coded accordingly, it would be very difficult to come up with an algorithm that could correctly parse an unknown address out of the stream of ASCII characters that make up a Web page. A Web page author could make the task a lot easier by embedding the address in a block of XML code:


<street> 52 Alzina Street </street>

<city> Barcelona </city>

<province> Barcelona </province>

<postal code> 08024 </postal code>


Such coding is more machine-readable than straight HTML, because a program could scan the page’s code for an XML-encoded address and then parse the XML to extract the address. But that still presupposes that the program doing the scanning knows exactly what definition of an address object will appear on the page. What if the company is headquartered in New York City rather than Barcelona? The page could use a different definition of an address object, substituting zip for postal code or state for province. Or what if the page contained several address objects, some pointing to post office boxes and some to various mailing addresses within the company? Any software program trying to determine the address of the company’s headquarters is likely to fail without more guidance.

The Semantic Web aims to provide that guidance in the form of encoded metadata that provide a context for Web-based data. Its goal is to turn the Internet into a vast, decentralized, machine-readable database. The owners of data will be able to determine who has access, using standard HTML access control methods. And as a data consumer, a citizen will be able to indicate the sources whose data she trusts. Even with limits imposed by both sides, any given Semantic Web application will have access to a vast range of data sources.

The possibilities for Semantic-Web–based software are nearly boundless: complex business-to-business (B2B) or business-to-consumer (B2C) transactions with no human intervention; aggregation, amalgamation and mining of research and historical data in ways beyond human capability; transparent, on-the-fly assembly, instantiation, and linking of distributed virtual applications; and intelligent sifting through data from millions of connected computers.


Bonner, P. The Semantic Web. PC Magazine. July 1, 2002.