IU

Indiana University Bloomington IU Bloomington IU Bloomington

Metadata Discussion Group data nerdery in the service of resource discoveryIndiana University Bloomington Libraries Homepage

homepage for Metadata Discussion Group
  • Home
  • Meetings
  • Listserv
  • Archive
  • About
  • Contact
Skip to content

Practical introduction to microdata and Schema.org

Posted on March 11, 2014 by Jennifer A. Liss

Practical introduction to microdata and Schema.org

At the upcoming March 25 meeting, the group will explore what it means to do business on the web scale. This post is one in a series of two blogs posts on the topic of making metadata scalable for the web.

Perhaps you’ve heard of SEO, or search engine optimization. Once meant to refer to strategies for making websites more discoverable to search engines, SEO has evolved into a business sector in its own right. SEO companies sprang up to help businesses “game” search engine algorithms, in order to makes those businesses appear at the top of search result lists. Years of increasing attention to SEO seems to have driven search engines like Google, Yahoo, and Bing into finding ways of leveraging web content to deliver relevant results to searchers. It’s not hard to imagine a future in which it isn’t enough to populate webpages with descriptive metadata about the content, authorship, and characteristics of that webpage. Doing business on the web is beginning to mean that organizations must markup webpage content in a semantically meaningful and machine processable way. This post introduces microdata and Schema.org as a way of telling machines the meaning of text.

HTML meta

Before elaborating on what microdata is, let’s backup and talk about how HTML has conveyed metadata in the past. HTML documents are comprised of two areas, the head element (HTML tag: <head>) and the body element (HTML tag: <body>). The body element is where you put all of the content you want people to see. The text you’re reading right now resides in the <body> tag of this HTML page. HTML body elements include tags for demarcating headings, paragraphs, lists, etc. In other words, HTML marks up syntactic or structural information in a block of text. Without the structure provided by HMTL tags, text would display in browsers as one long continuous clump without line breaks, white space, or font variation.

Though not typically displayed to users, the HTML head element provides information about the webpage such as the type of content and character set encoding (e.g., text/html, UTF-8), the website title (which is visible at the top of the browser window or tab), and sometimes the website author, description, and keywords. These website characteristics appear inside of the <meta> tag, short for metadata. Content within the <meta name=”description”> tag is most often used by search engines Yahoo and Bing for search result display. Yahoo and Bing retrieved the search result snippets shown in Figure 1 from the quoted search “krups ea9000 barista automatic espresso machine black stainless.” For comparison’s sake, I’ve selected the search result for the product as it appears on Zappos.com.

Bing Yahoo Snippet Comparison
FIGURE 1. Bing and Yahoo search result snippets for a product on Zappos.com

If I look at the HTML source code for the product webpage[1], I can see the full text of the meta description element, as it appears within the HTML head (Figure 2).

meta description tag
FIGURE 2. Meta description tag for a product on Zappos.com

Bing and Yahoo opted to choose the same specific portion of text included in the meta description element. Why did both search engines opt to display this particular section of the description text? Only by looking at proprietary algorithms could we attempt to find a reason.

Google also retrieves the Zappos page for this search; however, Google displays what they call a “rich snippet” (Figure 3). Google’s snippet includes some of the text from the meta description element but it includes other text as well. You’ll notice that the terms I searched for appear in bold text. Google pulled text not only from the <meta> tag in the <head> of the HTML document, it also pulled content from the <body> of the webpage where my search terms appear.

Google rich snippet
FIGURE 3. Google search result rich snippet for a product on Zappos.com

Google also displayed the list and sale price of the product, probably because someone at Google decided that such information is useful to searchers. How did Google know that those numbers were prices and not the number 9000 from the EA9000 model number or the number 23 from the product weight information? Because the prices on the Zappos webpage were encoded in microdata.

Microdata and Schema.org

FIGURE 4. Zappos.com product price
FIGURE 4. Zappos.com product price

In the web context[2], microdata is a HTML specification for embedding semantically meaningful markup chiefly within the HTML body. Microdata isn’t the same thing as metadata, as microdata isn’t restricted to conveying only information about the creation of the text. Microdata becomes part of the web document itself and serves somewhat like an annotation within the HTML body text. Microdata tells machines something more about the meaning of the text. On the Zappos product page, we see a nice display of the list price and sale price in the upper right hand corner of the webpage (Figure 4). Search engine web crawlers mining the same text in the HTML file see that the text “$2,499.99” is tagged with the Schema.org price property (Figure 5). Ah, so now we’ve come to it: how are microdata and Schema.org related? Basically, microdata is an HTML specification that allows for the expression of other vocabularies, such as Schema.org, within a webpage[3]. Just as XML provides syntax for expressing TEI or EAD or MODS, microdata provides syntax for expressing Schema.org or RDFa.

Schema.org price
FIGURE 5. Zappos.com product pricing marked up with Schema.org properties

I won’t go into the history of Schema.org (I touched upon it in past posts and this post has gotten quite a bit longer than I intended!); however it’s worth noting that the espresso machine example I’ve given above is limited, as Zappos hasn’t deployed Schema.org as extensively in their website as other companies have.

Try searching Google for movie times for a specific theater in Bloomington. At the very top of the search result list you should find structured display of movies, runtimes, MPAA ratings, showtimes, with links to trailers. How does this work? With Schema.org.

Welcome to the semantic web.

In the next of this two-part series, Rachel Wheeler will look at how libraries and library discovery layers are using Schema.org to expose resources.

References

Bradley, A. (2013 November 5). Basic vocabulary for schema.org and structured data. SEO Skeptic. Retrieved from: http://www.seoskeptic.com/basic-vocabulary-for-schema-org-and-structured-data/


[1] For instructions on how to view HTML source code in Internet Explorer, Chrome, Firefox, or Safari, see http://www.wikihow.com/View-Source-Code.

[2] The statistical community also uses the term “microdata” to describe individual response data in surveys and censuses–completely different beast!

[3] I would have spent hours trying to figure out the distinction between microdata, microformats, schema.org, etc. if not for an incredibly thorough description by Aaron Bradley, former cataloger turned web consultant.

 

Posted on March 11, 2014 by Author Jennifer A. Liss Posted in Categories Resources | Tagged: Tags microdata, schema.org, search engines, web metadata

Next meeting: Revealing Shareable Metadata for Complex Research Objects

Posted on February 11, 2014 by Jennifer A. Liss

This month, we’ll be thinking about large-scale, metadata sharing. Join us for a discussion about exposing metadata for complex research objects, such as datasets, to metadata harvesters. All are welcome to participate!

DATE: Tuesday, February 18
TIME: 9:30—10:30am
PLACE: Wells Library Room 043
TOPIC: Indecent Exposure: Revealing Shareable Metadata for Complex Research Objects
MODERATOR: Julie Hardesty

RESOURCES YOU MIGHT CONSULT

Opening up data as “data packets” on servers:

  • Pollock, R. (2013,  July 2). “Git (and Github) for Data | Open Knowledge Foundation Blog.” Retrieved from http://blog.okfn.org/2013/07/02/git-and-github-for-data/
  • Gandrud, C. (2012, June 13). Data on GitHub: The easy way to make your data available. R-bloggers. Retrieved from http://www.r-bloggers.com/data-on-github-the-easy-way-to-make-your-data-available/
  • Indiana University Libraries. (2013 October 17). Tei_text: Free-for-all repository of TEI and plain text files for you (to do cool stuff) provided by the Digital Collections Services Group at the Indiana University Libraries under the CC BY-NC 3.0 License. GitHub. Retrieved from https://github.com/iulibdcs/tei_text

Overview (and more!) on OAI-PMH:

  • Lagoze, C., Sompel, H., Nelson, M. & Warner, S. (2002 June 14). Open Archives Initiative – Protocol for Metadata Harvesting – V.2.0. Retrieved from http://www.openarchives.org/OAI/openarchivesprotocol.html
    • NOTE: for the 20 minute version, read 1. Introduction and 2. Definitions and Concepts

Examples of library web service APIs:

  • United States National Library of Medicine. APIs. (n.d.). http://www.nlm.nih.gov/api/
    • NOTE: gives super-brief info of what APIs are; the Voyager ILS supports an app that allows for searching NLM MARC records via the NLM API
  • Digital Public Library of America. API. (n.d.). Retrieved from http://dp.la/info/developers/codex/

What’s microdata and why it’s important:

  • Dean, R & Green, J. (2013 July 11). What to do when Google ignores your Fedora objects [PDF document]. Retrieved from http://or2013.net/sessions/what-do-when-google-ignores-your-fedora-objects
Posted on February 11, 2014February 19, 2014 by Author Jennifer A. Liss Posted in Categories Housekeeping | Tagged: Tags search engines, shareable metadata, web metadata

Summary of MDG Session, 9-17-09

Posted on September 24, 2009 by Jennifer A. Liss

* Written by Jenn Riley *

Article discussed: Schaffner, Jennifer. (May 2009). “The Metadata is the Interface: Better Description for Better Discovery of Archives and Special Collections.” Report produced by OCLC Research. Published online at: http://www.oclc.org/programs/publications/reports/2009-06.pdf.

An online, user editable resource list accompanying this report can be found at https://oclcresearch.webjunction.org/mdidresourcelist.

Continue reading “Summary of MDG Session, 9-17-09”

Posted on September 24, 2009January 12, 2012 by Author Jennifer A. Liss Posted in Categories Meeting Notes | Tagged: Tags archives metadata, finding aids, search engines, special collections metadata

Next Meeting

Stay tuned for future dates. Join our listserv to keep in touch.

Tag Soup

3D archives metadata BIBFRAME controlled vocabularies DACS DataCite data quality digitization projects discovery tools Dublin Core EAC-CPF EAD finding aids FRBR fun and games guest blog inclusive metadata legacy data library data linked data MARC21 metadata tools microdata minimum metadata requirements MODS name authorities NISO OCLC open bibliographic data RDA RDF schema.org scientific data search engines shareable metadata special collections metadata standards bias structural metadata user-contributed metadata VIAF W3C webinars web metadata Wikipedia XML

Indiana University

Accessibility | Privacy Notice | Copyright © The Trustees of Indiana University