Summary of MDG Session, 10-16-08

* Written by Jenn Riley *

Article discussed: Eklund, Janice. (2007) “Herding Cats: CCO, XML, and the VRA Core.” VRA Bulletin 34, no. 1: 45-68.

The Discussion Group began by picking up a theme from the first meeting of the semester: effective use of terminology in writing about metadata. This article did a good job using new terms consistently and frequently, although terms from VRA Core 3 were occasionally applied to a discussion of Core 4. The discussion of consistency then expanded to consistency in metadata itself. Consistency is very useful when one is combining metadata from multiple sources, and content standards like CCO can go a long way towards promoting this consistency.

The mention of CCO sparked a lively conversation about the way the word “standards” is tossed about in metadata circles. Is CCO a standard or not? CCO and VRA Core are not in total agreement, so what does it mean if both are standards we should follow? One can track why the difference exists—CCO has a broader scope, including museums, than VRA Core. CCO is a standard in the way AACR2 is a standard, but not in the way MARC is a standard. AACR2 is learned by practice, and less by reading the book. CCO is still evolving, taking time to learn and implement. It’s more a guide to best practice than AACR2 is. CCO is principle-based like RDA is supposed to be, because it needs to be applicable to many communities.

The next topic of discussion was whether or not VRA Core is really “core.” Its greater coverage for works of art than Dublin Core certainly speaks to it being a domain-specific “core.” The group was less sure if it represented an “exhaustive core.” Tracking VRA Core’s history could be instructive in this analysis – the evolution from Core 2 to Core 3 to Core 4 shows some stabilization, so this could be evidence that they’ve achieved an agreed-upon core. The only really new thing in Core 4 is the collection root element (in addition to work and image).

The linking capability of VRA Core was singled out as an especially effective part of the format, encouraging the use of identifiers to track relationships within text strings. There is not the infrastructure for collaborative development and sharing of authority records in the visual resources community that there is in the library community, so the process of record linking is more manual now in the VR environment than in the library/MARC community. But there is significant progress being made. The community needs to build good systems, and cooperate between institutions. They also need to expand the notion of authority control, to allow for more variety in name references, for example.

Efforts such as CONA (Cultural Objects Name Authority, forthcoming from the Getty) and the Society of Architectural Historians Architectural Visual Resources Network are helping to build the needed infrastructure. More cooperation overall is needed – the VR community and library community are both starting to realize that each of us having our own copies of records isn’t sustainable. Formats like VRA Core can promote fuller record sharing.

Using separate fields for display and indexing was another feature of VRA Core of interest to the discussion group. It was noted that this practice allowed a great deal of flexibility but also required twice as much work. To decide when this is necessary, one must consider how the information will be used—for search or display? in future systems in addition to future ones? how easy will it be to upgrade systems? It’s more important to include both for data elements that represent key features of the work or medium, for example, cultural context.

The discussion group noted that cultural objects cataloging could be a model for library catalogers looking to re-examine which aspects of their work require the attention of cataloging professionals. Cultural objects cataloging places a greater emphasis on analysis than transcription, which is necessary because cultural objects in general don’t explain themselves. Interestingly enough, some visual resources units are “outsourcing” subject indexing to traditional catalogers. Many catalogers on both sides don’t feel competent to do subjective indexing – is something “about death”? It’s much easier to record form/style, what something is rather than what it is about.

Summary of MDG session, 9-30-08

* Written by Jenn Riley *

Article discussed: Greenberg, Jane. (2005). “Understanding Metadata and Metadata Schemes.” Cataloging & Classification Quarterly 40, no. 3/4: 17-36.

The discussion began with a general question: Does the MODAL framework appear to be a useful way of evaluating metadata schemas? The group in general thought it was, although expressed concern that some of the language in the article was very academic, which sometimes made it difficult for practicing librarians to follow the argument.

Participants appreciated the fact that some metadata schema such as TEI (p. 28 of the article) have as a stated principle the conversion of resources to newer communication formats. This principle is of great benefit, and would be useful for other metadata schemas as well. Data formats will not stay static – our metadata must adapt its format over time to accommodate new ways of communicating.

Some participants noticed a contrast between the design of metadata schemas based on experience and observation and library cataloging rules that are more formalized and change less frequently. This observation led to the question of whether cataloging rules should be more fluid. When the rules do change, the changes are based on experience. From an implementation point of view, it is difficult both for libraries and our users if the rules are constantly changing. Our legacy data is a very real consideration here. So how do we be flexible and adaptable but at the same time consistent and keep up with the legacy data?

The MODAL framework spoke to participants as an analysis tool – helping evaluate the fitness of a given schema for a given purpose. This gets us away from saying a metadata format is “bad” – rather it lets us say that records using the Dublin Core Metadata Element Set are not well-fit to handle FRBRized data, for example.

The article’s methodology of bringing in Cutter’s objectives as an example of underlying objectives and principles sat well with the discussion group. One participant noted that not many current studies do this. These assumptions can help us focus our efforts. Follow up work could to do some comparison of Cutter’s objectives to different metadata formats.

Terminology issues were a hot topic of discussion at the session. Participants thought some kind of collaboratively-developed metadata glossary would be a good idea. They felt it was important for librarians interested in metadata issues to learn new vocabularies. We need to read more, ingest as much as possible, make connections to what we already do. “Cardinality” was an example of a term which was unfamiliar – it brings in the repeatable vs. not repeatable notion that is familiar, but also covers required/not required. Domains do have specialized vocabularies – they serve as “rites of passage” into various professions. Metadata schemes all have context that assumes a specific knowledge base – this article recognizes that. It would be nice if articles had glossaries, though.

Even with discussion, definitions of some terms did not establish a clear consensus. The term “granularity” was defined in the group as “refinement,” “the amount you want to analyze down to,” “extent of the description,” “specificity,” and “granular means you can slice in different ways.”

Participants appreciated the empirical focus of the article, saying that metadata schema design should be observation/experiment based. It’s certainly a good thing to have metadata be practical – actually useful. To help decide what metadata schema to use, try out a couple schemas and see how they work, rather than thinking more abstractly. But also need consider community as a factor. The MODAL framework is “multi-focal” – focusing first on one aspect then go to another. Helps implementers think, for example, about both the community and the data itself.

Participants noted two schools of thought for metadata design: a difference of orientation thinking of a problem looking for a solution, as contrasted with a solution looking for a problem. Is there sill room for cataloger judgment? Absolutely. Perhaps cataloger’s judgment is needed more in the application of a content standard rather than a structure standard.

This distinction led participants to speculate whether the line between the two is blurring (although all recognized it has always been somewhat blurry). RDA especially seems to be trying to do both simultneously. One participant noted that libraries seem to be moving to blur the two, while other communities are moving to separate them more.

Is terminology the only barrier to learning more about metadata? Some individuals learn better with theory and others with practice. All need a little of both. It really just takes time – remember what it was like to learn cataloging? Getting out of one’s comfort zone is difficult. It’s also difficult to be adventurous, when there is less precedent to follow. It’s hard to learn many standards – don’t always know which one to use. When you have to learn lots of things, you learn each of them less well. We also have new objectives, including reaching new people and operating in additional systems. It would be helpful to identify models of other institutions where a technical services unit has made significant progress in these areas.

The group found Table 1, which outlines some typologies of metadata schemas, to be interesting. The lines between them seem arbitrary at worst and murky at best. Over time the thinking in this area has gone from 7 categories to 4 – does this mean our community is looking for simplicity? Does this mean this environment is settling down? Maybe, but initiatives such as the DCMI Abstract Model seem to be going the other direction.

The discussion moved relatively seamlessly from topic to topic, and featured a number of insightful comments, often from new participants. Both nitty-gritty and “big picture” issues were raised. Thanks to all who participated for an enlightening discussion.

Summary of MDG session, 5-27-08

* Written by Jenn Riley *

Article discussed: Hagedorn, Kat, Suzanne Chapman, and David Newman. (July/August 2007) “Enhancing search and browse using automated clustering of subject metadata.” D-Lib Magazine 13, no. 7/8.

The session began with a brief explanation of the methodology employed by this experiment and the OAI-PMH protocol, as it may not have been clear to those who don’t deal with this sort of technology on a regular basis. After this introduction, discussion moved to wondering why the Michigan “high-level browse list” was chosen for grouping clusters, rather than a more standard list? The group realized the value of a short, extremely general list for this purpose, and noted our own Libraries use a similar locally-developed list. Most standard library controlled vocabularies and classification schemes have far too many top terms to be effective for this sort of use. It was noted that choosing cluster labels, if not the high-level grouping, from a library standard controlled vocabulary would promote interoperability of this enhanced data.

The question of quality control then arose: the article described on person performing a quality check on the cluster labels – this must have been an enormous task! The article mentioned mis-assigned categories that would have been found with a more formal quality review process. Have they thought about how they would fix things on the fly – features like “click here to tell us this is wrong”? Did the experiment designers talk to catalogers or faculty as part of the cluster labeling process? Who were their colleagues they asked to do the labeling?

Is their proposal to not label the clusters at all, but to just connect to the high-level browse categories a good on? The group posited that the high-level browse used the campus structure of majors, rather than not organizational structure of the university. (This is the way the IU Libraries web site is structured). In this case, the subcategories more meaningful than main categories, so at least this level would likely be needed.

The discussion group noted evidence of campus priorities in the high-level browse list, for example that the arts and humanities seemed to be under-represented and lumped together while the sciences received more specific attention. Did this make a difference in the clustering too? As noted in the article, titles in the humanities can be less straightforward than in other discipline, making greater use of metaphors. What do the science records have that humanities records don’t? Abstracts, probably – anything else? Perhaps it’s just that the titles were more specific. Do science subject headings contain more information? Description in humanities collections might be more varied than the language in sciences? Many possibilities were presented but the group wasn’t sure which would really affect the clustering methodology.

The group then wondered if the humanities/sciences differences noted in this article would show up in a single institution, or was it just caused in OAIster because of the fact that different data providers tend to focus on one or the other and the difference is really between data providers rather than between disciplines. The group noted (as a gross generalization) that humanities tend to be more interested in time period, people, and places, whereas the sciences are more interested in topic.

Would the clustering strategy work locally ad not just on aggregations? The suggestion in the article that results might improve if run just on one discipline at a time suggests it might. In this case, clusters would likely be more specific. Perhaps an individual institution could employ this method on full text, and leave running it on metadata records alone to the aggregators. It would be interesting to find out if there’s a difference in effectiveness of this methodology on metadata records for different formats, for example, image vs. text.

The group noted the clustering technique would only be as good as the records from the original site. What if context were missing? (the “on the horse” problem) Garbage in, garbage out, as they say. We understood why the experiment only used English-language records, but it would be interesting to extend this.

The clustering experiment was run using only the data from the title, subject, and description fields. Should they use more? Why not creator? This is useful information. Was it because clusters would then form around creators, which could be collocated using existing creator information? The stopword list was interesting to the group. It made sense why terms such as library and copyright were on it, but there are resources about these things, so we don’t want to artificially exclude them. What if the stopword list were not applied to the title field?

The discussion group wondered how these techniques relate to those operating in the commercial world. Amazon uses “statistically improbable phrases” which seems to be the opposite of this technique – identifying terminology that’s different rather than the same between resources. What about studies comparing these automatic methods to user tagging? No participants knew of such a study in the library literature, but it was noted there might be information on this topic in the information retrieval literature. It would be interesting to compare data from this process to the tags from users generated as part of the LC Flickr project.

The article described the overall approach as attempting to create simple interfaces to complex resources. Is this really our goal? We definitely want to collocate like resources. The interface in the screenshots didn’t seem “Google-style” simple. The group noted that in the library field many believe simple interfaces can only yield simple answers and that people looking with simple techniques are generally just looking for something rather than a comprehensive research goal. This article doesn’t have in its scope a discussion as to whether this is true. One big problem is that the article never defines its user base, ad different user bases employ different search techniques.

The discussion group believed that browseability, as promoted by the clustering technique, is a key idea. With a good browse, the interface can provide more ways to get at resources, and then they are more findable. Hierarchical information can be a good way to get users to resources. With the experiment described in this article, the hierarchy is discipline/genre. Would retrieval improve if we pulled in other data from the record to do faceted browsing? Would this work better for humanities rather than science? Do we need to treat the disciplines differently?

Discussion group participants noted that “this isn’t moonwalking,” meaning that this technique looks promising. It needs some tweaking, but the technique hasn’t promised the moon – it’s not purporting to be a be all, end all solution. Its just something we can do, as one part of the many other techniques we use. Can a simple, Google-style interface eventually work for intensive research needs on this data? Or should it? Should the search just lead them to a seminal article and then they citation chase from there? These are interesting questions.

The group then wondered if the proposal to recluster only every few years was a good one. They would certainly need to do it when getting big new chunks of data that are dissimilar to what’s already in the repository. A possible method would be to randomly test once per month to see if clusters are working out well.

The session ended with some more philosophical questions. Why should services like OAIster exist at all if Google can pick these resources up? Is this type of services beneficial for resources that will never get to the top of a Google search in their native environments? What would happen if one were to apply these techniques to a repository with a more resource-based rather than subject-based collection development policy?

Summary of MDG session, 4-22-08

* Written by Jenn Riley *

The article for discussion this month was:

Borbinha, José. (2004). “Authority control in the world of metadata.” Cataloging & Classification Quarterly 38(3/4): 105-116.

The article provoked a lively discussion that centered largely around the future and functions of authority control. It began by wondering what the tie-in to authority control in the article, as implied by the title, really was. The creator concept is very strong in the article, and authors are something we traditionally control in libraries, although archives treat creators differently. The explicit connection of the article to authority control is at the very end: it’s not so much about all using the same rules, but that we know what rules are being used. Control not as important as interoperability. Is this a good conclusion? It’s a practical one.

The discussion in this article is most useful to practicioners, in that it helps us think about why we do authority control in the first place. Some concern was expressed about the very general statements being made in what was perceived as overly technical language.

At this point, there was a bit of confusion in the room, as two participants realized they’d read the wrong article in preparation for the session. This article:

Vitali, Stefano. “Authority Control of Creators and the Second Edition of ISAAR(CPF), International Standard Archival Authority Record for Corporate Bodies, Persons, and Families.” Cataloging & Classification Quarterly 38(3/4):185-199

…is an interesting one, discussing in depth the motivations and methods for authority control in archives. It’s well worth a read.

That being settled, we returned to the Borbinha article. The effectiveness of crossing institituional borders was questioned– we don’t do this well but Amazon seems to. Perhaps we’re still too focused on our methodology, rather than the goal.

The moderator asked if the conceptual vs. context, etc., perspective was useful. The group was uncertain on this issue, and the gulf between theory and practice emerged as a discussion topic. Practitioners mostly know the records one sees in the cataloging interface, and in this mode, the distinction between, say, structure standards and content standards, can be confusing. AACR/MARC/data entry system are all taught together – the distinction between them is not generally made in training or daily life. Practitioners tend to move through the learning curve with both integrated into one’s mind. So it’s hard to think that we can make an AACR record in Dublin Core – overall it’s very hard to talk about one without the other. Most never see the under-the-hood record coding at all. But in some cases it is useful to keep these distinctions in mind. What do we gain from thinking of them differently? It’s likely not going to be effective to teach the conceptual first and then the practical.

From public service perspective, these functions as shown in Figure 3 are a black box that searches go into and come out of. How useful are these distinctions to that community?

Discussion then turned to the Fellini example in the paper– how would a system bring these together without authority control? Can we live with a system that isn’t perfect? Can we trust a secret and proprietary algorithm? What about the model of a human-generated Wikpedia page with a disambiguation step? Is it better to do the matching up of names ahead of time, or at search time?

Can Google connect Mark Twain and Samuel Clemens, as well as just misspelling Mark? Our authority records handle forms of names found in items, Google handles common misspellings. Are these things different? Authority control serves lots of purposes: disambiguate same name, collocate same person with multiple names, etc. But there’s room for both approaches. Search systems could have an authority list running underneath. Google works differently, pulling data from the Web rather than from an authority file. Ranganathan’s principle: “every book its reader” – can we say every search term its hit? OCLC WorldCat – we can guess they’re employing both the authority-based search and Google-based methodologies? Maybe not, doesn’t seem like they’re stemming. Their searches seem to be very literal.

The dual approach seems promising, as authority files have different purposes than the Google-style work. Maybe we could pull in data from 670 references (or from more structured data proposed by FRAD and showing up in RDA drafts)? Ask the user: “do you want the chemist or social scientist”?

Heterogeneity is part of our life, as the article mentions. We simply have to deal with it. We should find small models that can deal with it, and build on those. What about heterogeneity of thesauri? Some cases it’s clear what thesaurus to use, in others it’s not. When you use different vocabularies, it’s a barrier to interoperability – how do we overcome this? This is the tension between doing one thing well and using the same standards for everything but do each less well that has come up in our discussions before. Yet Google and Amazon aren’t worrying about this. Google connects Twain and Clemens because somebody made the connection on the web page.

This is one of the drivers behind the “OPAC sucks” movement – for the audience that just wants something, not everything. There’s a mistmatch between this goal and the one OPACs are designed around. But maybe users actually want something good (not everything good). Our systems don’t have most useful stuff at the top. WorldCat tries to do this by ranking by holdings. We’re doing something wrong when the Director of Technical Services goes to Amazon to find a book because she can’t find it in or catalog!

Summary of MDG session, 3-18-08

* Written by Jenn Riley *

The article for discussion this month was:

Yakel, Elizabeth, Seth Shaw, and Polly Reynolds. “Creating the Next Generation Archival Finding Aids.” D-Lib Magazine 13, no. 5/6 (May/June 2007). Available from

Early on, the discussion focused around the predictability (or lack thereof) of EAD files. EAD as a markup language is designed to be flexible, for the encoding of many different types of finding aids. This means that any two EAD-encoded finding aids may not look very much alike. The potential of using a common controlled vocabulary across finding aids was envisioned as one way to tackle this fundamental unpredictability. The group expressed the idea that for sharing, broad subject headings are good, despite the claim of the article that these weren’t adequate. However, within the local environment, the specific ones this article says were needed make sense.

A large part of the group’s discussion of this article worked through how better access could be provided to these materials with some reasonable level of expediency. While detailed analysis such as noting that a proverb appears within a story within a volume in the IU Folklore collection could be beneficial, it’s unlikely we can afford to be this detailed. Respondents reported that it’s often difficult to resist the urge to provide this detailed analysis, even though there is pressure to process collections quickly. One has to ask, how meaningful will the description be if I don’t go into more detail? One has to stop and think. General practice is to only pull out the “important” data, such as only some names rather than all of them. One participant noted that a recent article in the American Archivist (Fall/Winter 2007, Vol. 70, No. 2), “Archives of the People, by the People, for the People,” by Max J. Evans discusses how one might get more mileage out of an EAD-encoded finding aid. [Note from after the meeting: this same volume has another article on the Polar Bear Expedition project which might address some of the issues the group was wishing were discussed in the article we read for this week. “Interaction in Virtual Archives: The Polar Bear Expedition Digital Collections Next Generation Finding Aid,” by Magia Ghetu Krause and Elizabeth Yakel.]

Some members of the group expressed interest in studying how keyword indexing of full text could be used to help add description for archival collections (although the group realized automatically generating transcriptions of scanned handwritten documents is currently not very feasible). The CLiMB project at Columbia was noted as an example of how this technology might work. The possibility of capturing transcriptions from users was discussed.

Participants noted the potential utility of user-supplied information, as these users often have a vested interest in and knowledge of the materials.

The group wondered why the project staff was hesitant to include information from the database with data on soldiers, including birth dates, death dates, etc. The prevailing thought in the room was that if the catalog can include this short of information, it should – that this sort of information was not fundamentally out of scope of the “catalog.”

Participants noted several features of the Polar Bear Expeditions site that they believed had been implemented well, including providing coherence to a collection brought together by theme rather than by format, effective browsing (although it was noted the browse might be used more because the search feature was not very full-featured!), and the fact that the entire collection had been digitized rather than just highlights. Some drawbacks were mentioned as well, most notably the current lack of critical mass of user comments, and clear information on what it is that brings these various collections together.

A “wish list” for more information on this project emerged, including specifics on the metadata implementation (e.g., what controlled vocabularies were used), and to what degree site features were developed in response to use cases and user studies. For example – the “visitor awareness” feature appears to be a way of getting users to talk to each other. The article didn’t describe how this feature was determined to be a priority – was it implemented in response to a defined need or just because it was interesting? Participants also wanted more information on balancing this sort of functionality with user privacy issues, while recognizing that this sort of project can open users’ minds as to what is possible, allow us to get feedback from them, and to ask them what they want, while they’re using it.

The challenges described in this article were disheartening to some participants, who felt that this project represents a best possible case, with all the material already digitized. The fact that there were still so many problems is a bit scary, as the mantra we’ve been hearing is that online materials were supposed to make this sort of thing much easier. Or are we just making these system too complex? Flickr seems to work, and it operates at a much simpler level. To what degree does the system need to reflect the complexity of the collections and the items within them?

Summary of MDG session, 2-26-08

* Written by Jenn Riley *

The February 2008 meeting of the Metadata Discussion Group drew about 50 attendees. Thank you to everyone who continues to make this group a success.

The article for discussion this month was:

Chapman, John. “The Roles of the Metadata Librarian in a Research Library.” Library Resources & Technical Services v. 51 no. 4 (October 2007).

The discussion began examining what job responsibilities presented in the article represented entirely new tasks, which were slight evolutions from current practice, and which seem pretty much the same as duties of some current technical services positions. For the most part, the group felt that the four areas described (collaboration, research, education, development) were generally part of technical services responsibilities currently. Collaboration, especially with collection managers (Archive-It is an example of this at IU), was an area participants felt was already a strong part of technical services jobs, although expanding the scope to working directly with faculty might be necessary in the future.

The area of development was thought to be the most different from “traditional” technical services positions, requiring a stronger need to think about the final form of access for materials being described. Technical services staff needed to deal with this in the early days of automation but since access hasn’t changed much since then there needs to be more thinking in this area. Staff will increasingly need to deal with different levels and types of metadata – some web presentable, some more internally-focused. The will need to work closely with technology-intensive positions (although they do this already with MARC data). Designing new platforms and interfaces is what’s new.

The decision in this article to only look at positions within technical services may have been a practical one, but it does potentially introduce homogeneity into an inherently heterogeneous environment. Dealing with this heterogeneity is a key role of metadata staff. The MARC/AACR2 stack is well-tested, some of the newer ones aren’t. Metadata librarians will have to determine for each new set of materials which sets of standards to use. A major weak point now is that our mainstream cataloging system can only handle the one set of standards, so our users have to go to multiple places to access content. This heterogenity makes interoperability difficult. How do we allow things to be different when they need to be but not make them different just to be different? It’s fun to make up new things, but we have to be sustainable.

A participant noted that a colleague from another university had observed a trend that in places where there is significant funding for digital library work, metadata tends to be a separate operation, outside of technical services, and in places where there’s no funding, technical services is often asked to take non-MARC metadata work on themselves. Library organizational models are so fluid, it’s no surprise there are so many different models out there for metadata librarians and digital library work.

Asking technical to do non-MARC metadata is a huge investment – it’s asking already busy people to do more things. We also think we need higher-level salary lines for this planning work. But technical budgets are being cut. How do we deal with this? Don’t think of it as dumping more work on folks. Think of it as adapting to the world as it changes. It’s an exciting opportunity. Think outside the box – metadata work in acquisitions perhaps.

Regardless of the reporting structure, the group felt a strong need to move to mainstream processes. We know enough about how to deal with many types of material, even with non-MARC metadata, to make it operationalized.

A participant posed an interesting hypothetical situation: tomorrow we all came to work and all jobs with “cataloging” in the title changes to “metadata”. What would we need to make that happen? The first reaction to this proposal was that MARC is metadata so this could be true now. To expand into other types of metadata, would need training. More contact would be required with subject specialists to learn about needs these staff would not currently be aware of based on current standards. Staff would need a lot of support during the transition.

One function of metadata is to organize information, another function is to make connections between things. This means subject specialties will be more important into the future.

People think cataloging of online resources when you say metadata. But many definitions of metadata are broader, so the word is almost useless now in many cases. A view raised in discussion was that metadata facilitaties online discovery, whether the material is online or not. It’s the long-held idea of metadata as a document surrogate.

Although descriptive metadata is what’s primarily being discussed, acquisitions departments may need to deal with other types of metadata, especially rights metadata. Flexible staff that can take on digital library type activities when other duties lull are likely to be needed. Libraries will need to continue to prepare individuals for new work.