Metadata

From LISWiki
Jump to: navigation, search

Metadata refers to the information used to describe other information or its structure; "data about data". The two main types of metadata are structural and descriptive. Below is an essay about metadata.

Contents

[edit] Introduction

Since creating and using metadata is such an historical function of libraries and librarians it seems only natural that we should use it on our weblogs. After all the MARC standard, created in the late sixties and early seventies, is still a viable structure and the ISBDs, based on the Paris Principles even predate that. How many metadata structures can claim such a long and useful life? Libraries and metadata are linked. We understand the utility, construction and exchange of bibliographic records and the importance of standards in that process. We have created an alphabet soup of committees to oversee our standards and guide their development.

[edit] Differences between bibliographic metadata and web site metadata

However, the reasons for adding metadata to Web pages in general and Web logs in particular differ from those used to justify cataloging. Use will probably not greatly increase after the effort. Not many of the general search engines index metadata tags on Web pages. There is the problem of unethical people using metadata to describe their site as something other than what it is; spamming the search engine, in effect. Within a community of trust we can accept the description as accurate. For instance, we trust the description in a record found on WorldCat or RLIN to be accurate. It never occurs to us to wonder if the description is misleading and exists only to lead us to pornography or on-line gambling. Metadata found on some Web sites may exist only for such a purpose. Metasearching or federated searching is a tool that can be used to search within a community of trust and so solve, to some extent, the problem of trust. The tools for metasearching are in the early stages of development, and it is not yet clear what metadata they will finally read and use. This is an area that deserves close attention and input from the library community.

Another difference between bibliographic metadata and that on our Web sites is that bibliographic records can be shared and reused. When we describe an on-line resource we are describing a unique entity. Metadata can be harvested and used to point to the material but the metadata can rarely be reused to describe another copy of the resource.

[edit] Standards development

Many of the Web metadata standards are in early phases of development and major changes may be necessary in the near future. The XML family of standards is still developing, so any tools built using those standards are also forced to change; that is in addition to any changes, tweaks and enhancements to the standard itself. Some may even fade away and all effort put into them be wasted. The developers of these standards are not librarians, most come from the computer field, the standards are beyond our control. All these problems, and others, make adding metadata to our weblogs more risky than traditional cataloging.

These very problems are also the reasons we should be involved in the Web metadata process. We will not see a large increase in visitors to our site, but we should see some increase. Geographic metadata in particular has an important future. Someone should be able to locate your institution by searching geo-coordinates. That will only become more important over time. Other metadata allows filters to understand your site is suitable for all audiences; that may be important for users that have filtering software on their home machines. Some Web services may collocate sites based on their metadata. Bringing like items together is certainly something we want from our catalogs, and from Web services. Making a site more available for resource discovery is a valid reason for using metadata.

[edit] Our contribution to metadata development

The fact that information standards are being developed and not by librarians should be seen a call to arms by the profession. We have something to contribute to this discussion but unless we make ourselves heard on the standards committees our voice is lost. This can be done by reading the proposed standards and Request for Comments and contributing to the discussion. We can beta test the tools and provide input to their development. Those of us with coding skills can participate in open-source projects. We can even get places on the committees setting information standards. We can save the developers from common errors and oversights and make our concerns heard. It is still early enough in the process of many standards for changes to be made. Information standards should not be beyond the scope of librarians. By using metadata on our pages we have something to contribute to the discussion.

Working with other metadata schemes can benefit the cataloger. Often seeing something done differently can lead to a better understanding of our toolset, MARC/ AACR/ISBD. Or maybe seeing something done better could lead to an improvement in our own standards. Other standards have their proper place in the information spectrum and perhaps the institution would gain by having an understanding of available options. When all you have is a hammer, everything looks like a nail and when all you have is MARC everything looks like a bibliographic record. Below are many standards; others such as GILS, FGDC, Encoded Archival Description or METS may be the best option in a particular circumstance. But to advise on the best course of action, we must know the options available to us.

[edit] What is metadata?

Before progressing further a discussion of metadata itself might be in order. I find this a useful definition:

Metadata are structured, encoded data that describe characteristics of information-bearing entities to aid in the identification, discovery, assessment, and management of the described entities. Committee on Cataloging Task Force on Metadata Summary Report (1999)

Metadata is structured and encoded, not free text. A book review may contain most of the information in a catalog record, but it is not metadata while the catalog record is. Structured encoded information is the difference. The MARC record structure provides access to particular fields and machine processing of the information. If the book review was marked up using the Text Encoding Initiative tags, then it too might enter the realm of metadata.

Metadata serves many purposes, "identification, discovery, assessment, and management." Some metadata, added to a Web site, may aid in the site being found by a user. Other metadata might help the host institution weed old and outdated information from the site. Both are valid uses and both can be aided by metadata.

[edit] Metadata for web sites

Metadata on our weblogs can be that designed specifically for use on Web logs or more generally designed for the Web. The Web has existed for better than a decade and so has more stable and developed metadata standards than those designed specifically for weblogs, which have only come to wide use in the past few years. This reduces the potential risks of using this metadata but also reduces our potential contribution, since these standards are more nearly finished.

The most basic metadata is that included in the HTML specification for use in the HEAD section of pages. Title is defined and used by many search engines and directories as the display element in their short list of hits. The META tag is less widely used and supported but much more flexible. It can be used to include keywords, a description of the site, the author, the dates the site is valid and the last revision date.

For example:

<head>
<title>Catalogablog</title>
<meta name="author" content="David Bigwood" />
<meta name="keywords" content="Library cataloging, Classification, Metadata, Subject access" />
<meta name="description" content="Web log concerned with library cataloging, metadata, classification and related topics" />
</head>

[edit] Platform for Internet Content Selection

Platform for Internet Content Selection (PICS) is a metadata standard developed since not all Internet sites are appropriate for all viewers. It is designed to aid parents and teachers in controlling what children access on the Internet but allows users to avoid sites they may find offensive. It allows the host to rate their site on such topics as nudity, violence, sex, tobacco, gambling, and other sensitive areas. PICS was developed by a W3C committee but they left the labeling function to other institutions. The Internet Content Rating Association (ICRA) has emerged as the premier rating service. They provide a fill-in checklist to generate the metadata tag to paste into the header of your Web site.

<meta http-equiv="pics-label" content='(pics-1.1 "http://www.icra.org/ratingsv02.html" l gen true for "http://www.catalogablog.blogspot.com" r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true for "http://www.catalogablog.blogspot.com" r (n 0 s 0 v 0 l 0))' />
Example of a PICS label from Catalogablog
The PICS label is easy to generate and use. Simply choose the options and fill in the blanks at the ICRA Web site. Then paste the code from their site into the HEAD section of your Web site. It provides output in HTML, XHTML, or PHP to paste into the document. It also provides HTTP for use with Apache and IIS servers. The label will allow users blocking unrated sites to access the library's Web site. That number of users is probably low, those concerned about offensive materials on the net probably either avoid it all together or use a third party filtering software. Few would look at the Microsoft Internet Explorer options to enable the Content Advisor.

[edit] Dublin Core Metadata

The Dublin Core Metadata Initiative (DCMI) originated in a meeting held in Dublin, Ohio attended by both librarians and the computer community. Dublin Core is a set of 15 elements with recommendations on encoding used to aid in resource discovery and interoperable among different user communities. It is much simpler, readable and more basic than a MARC record.

01504nam 2200229 a 45000010008000000030006000080050017000 14008004100031024003200072040002200104 24501410012626000460026730000330031344 00085003465040041004315050414004725000 27000886650003601156650002401192700002 801216852003001244?SI02282?TxHLS?20040 416143235.3?040416s2004 ne a b 000 0 eng d?41a09242716903200411099583/401? aTxHLSbengcTxHLS?00aIntegration of geodata and imagery for automated refinement and update of spatial databases /cedited by Christian Heipke ... [ert al.].? aAmsterdam ;aNew York :bElsevier,c2004.? a127-258 p. :bill. ;c27 cm.? 0aISPRS journal of photogrammetry and remote sensing,x0924-2716 ;vv. 58, no. 3-4? aIncludes bibliographical references.?20tTowards an operational system for automated updating of road databases by integration of imagery and geodata/rChunsun Zhang --tDetecting building changes from multitemporal aerial stereopairs /rFranck Jung --tReconstruction of 3D building models from aerial images and maps/rIldiko Suveg and George Vosselman --tObject-based classification of remote sensing data for change detection /rVolker Walter.? aPapers "deal with the integration of topographic geo-spatial data in vector format (and in particular roads and buildings) with non-interpreted airborne imagery and digital terrain and surface models, for automated refinement and update of the vector data."--p. 27.? 0aGeographic information systems.? 0aAerial photography.?1 aHeipke, C.q(Christian)? aTxHLSbMobile 2820040416?
Example of a MARC Record
<dc:title>Integration of geodata and imagery for automated refinement and update of spatial databases /</dc:title>
<dc:creator> Heipke, C. (Christian) </dc:creator>
<dc:type>text</dc:type>
<dc:publisher>Amsterdam ; New York : Elsevier,</dc:publisher>
<dc:date>2004.</dc:date>
<dc:language>eng</dc:language>
<dc:description>Includes bibliographical references.</dc:description>
<dc:description>Papers "deal with the integration of topographic geo-spatial data in vector format (and in particular roads and buildings) with non-interpreted airborne imagery and digital terrain and surface models, for automated refinement and update of the vector data."--p. 27.</dc:description>
<dc:subject>Geographic information systems.</dc:subject>
<dc:subject>Aerial photography.</dc:subject>
Example of the same record in Dublin Core
Adding Dublin Core metadata to your Web site or weblog will most likely have little effect on the number of visitors. Like PICS it is easy to generate, there are several tools and some software to create the structure and generate a set of tags suitable to paste into the header of the page. DC has become the basis of many other encoding schemes so a basic understanding is very useful and will be applicative to those other schema. Some of the available tools include the Dublin Core Metadata Template provided by the "Nordic Metadata Project" and DC-dot (UKOLN). The commercial products TagGen and Metabrowser Client from Metabrowser Systems also produce Dublin Core records. The Dublin Core Metadata Template provides a fill in the blank template to create DC metadata. A nice feature is the ability to repeat a field by checking a box. All fields are repeatable in Dublin Core. It provides metadata to paste into the HEAD section of your Web page in HTML. DC-dot provides output in XHTML, HTML, RDF and XML. It also allows the metadata to be output in MARC, GILS, as a TEI header and many other formats. It will read a site and fill in the fields with data from the site. These and other field can be edited and the result pasted into the page being described. DC-dot will validate metadata already present. The Reg Metadata Editor also reads a page and populates the form. It allows fields to be duplicated or removed. It provides output in RDF, HTML and SOIF. It will create records in another schema if pointed to a properly formatted schema description. The Editor-Convertor Dublin Core Metadata, at the Chizhevsky Regional Universal Research Library will read a page, allow changes to be made and then save the record in HTML or UNIMARC and be saved in ISO 2709 format. DC-assist is a help tool for use with DC metadata. It shows the fields, qualifiers and gives examples.

[edit] A-Core

A-Core (AC) is administrative metadata and can be used with any other metadata scheme. It exists to describe who created the metadata, how to contact them, when it was created, the dates it is valid, the location of the metadata, and rights ownership of the metadata. It is important to remember it describes the metadata not the resource the metadata describes. It is simple to create and easily added to the end of a Dublin Core or other metadata tag set.

<link rel="schema.DC" href="http://purl.org/dc" />
<meta name="DC.Type" content="Collection" />
<meta name="DC.Title" lang="en" content="Catalogablog" />
<meta name="DC.Creator" content="David Bigwood" />
<meta name="DC.Subject" lang="en" content="Cataloging; Classification; Thesursus; Indexing; Subject headings, Metadata" />
<meta name="DC.Description" lang="en" content="Web log concerned with library cataloging, metadata, classification and related topics" />
<meta name="DC.Date" scheme="W3CDTF" content="2002" />
<meta name="DC.Type" scheme="DCMIType" content="Text" />
<meta name="DC.Format" scheme="IMT" content="text/html" />
<meta name="DC.Identifier" content="http://www.catalogablog.blogspot.com" />
<meta name="DC.Language" scheme="ISO639-2" content="eng" />
<link rel="schema.AC" href=" http://metadata.net/admin/draft-iannella-admin-01.txt/" />
<meta name="AC.name" content="Bigwood, David" />
<meta name="AC.email" content="bigwood@lpi.usra.edu" />
<meta name="AC.contact" content="Phone 281-486-2134" />
Example of Dublin Core and A-Core tagging from Catalogablog
GeoURL used the ICBM meta tags to record geographic information in the latitude, longitude format. Their Web search engine will harvest this information and make it available to searchers. It is possible using these tools to find all Web sites within so many mile of a location. They provide a tag generator, but you should find your co-ordinates before using it. An interesting feature they offer is a map showing recently updated sites. This is done by sending a ping to their site, or using the form they provide to send the ping.
<meta name="ICBM" value="29.52043,-95.04799" />
<meta name="DC.title" value="Catalogablog" />

[edit] Geo Tags

Geo Tags are another geographic location tool. They are used by the GeoSearch engine and can also be read by GeoURL. The site provides a tool to generate the tags so they may be pasted in the HEAD section of your page.

<meta name="geo.position" content="29.52;-95.1" />
<meta name="geo.region" content="US-TX" />
<meta name="geo.placename" content="League City" />

Location information will only become more important over time. An easy method of finding local institutions and their resources may be the next major refinement of Web searching. Just how this will be accomplished is not yet decided. Neither of the two tools above has gained universal acceptance, nor even widespread use. Whether the final tool is one of these or something yet to be developed remains to be seen. Our community has something to contribute to the final decision and our voice needs to be heard.

[edit] Creative Commons

The Creative Commons metadata is unique since it deals with copyright and use issues, not description and resource discovery. It provides the possibility of a less restrictive use policy than standard copyright. This metadata is often applied to photographs, music, and other creative materials as well as Web pages and Web logs. It is fairly widely used and tools exist to create and search the metadata. To create metadata suitable to paste into your 'blog or Web site, simply select from a few options, then generate, and cut and paste the metadata generated. The Creative Commons RDF-enhanced search engine allows limiting the search by format and intended use. There is also a tool to validate the metadata and a Mozilla plugin, MozCC, that reads and displays Creative Commons metadata.

Librarians, who often have an inclination to provide open access to information, may find the Creative Commons approach appealing. However, it would be best to consider the implications and policy of the institution before making a hasty decision.

[edit] Metadata specific to weblogs

We now turn to metadata formats, if not created specifically for Web logs at least more common in that area than other parts of the Internet. There are actually two different kinds of metadata in this area, those that describe some aspect of the weblog as a whole and those that apply to individual postings. The difference is between a bibliographic record and a back of the book index. Metadata formats that describe something about the weblog as a whole will be considered first.

[edit] RSS

The most common metadata format associated with Web logs is RSS. The confusion with what the initialism stands for (RDF Site Summary, Really Simple Syndication, or Rich Site Summary) reflects the various flavors and competing standards. It has been widely adopted by the blogging community; tools to create and process the format are available. It is often thought of as an alternative transmission format, competition for e-mail for example. However, only the titles of the individual posts or a truncated version of the post is sent. It is acting as the equivalent of a table of contents, certainly metadata. RSS is so important it is treated more fully in an entire chapter elsewhere in this book. Just remember it is an important, if not the most important metadata standard available for Web logs.

[edit] Outline Processor Markup Language (OPML)

A feature common to many Web logs is the list of similar sites running down one side, a blogroll. A mapping of those links and then the links on each of those sites would provide a view of Web sites dealing with a particular topic. Valuable sites would be referenced more than sites considered tangential or less valuable to the topic. Many tools are being developed to create and map this social network. This mapping may become a part of the semantic web.

Outline Processor Markup Language commonly know as OPML is widely used in the blogging community to store, transfer and process lists of related sites. The OPML standard is useful for much more than that, thesauri and ontologies for example, but the most common use is to store related Web sites. It is a simple XML standard, so it is application and environment neutral.

<opml version="1.0">
- <head>
<title>Bloglines Subscriptions</title>
<dateCreated>Sat, 24 Apr 2004 21:31:20 GMT</dateCreated>
<ownerName>Catalogablog</ownerName>
</head>
- <body>
- <outline title="Subscriptions">
<outline title="Catalogablog" htmlUrl═"http://catalogablog.blogspot.com" type="rss" xmlUrl═"http://catalogablog.blogspot.com/rss/catalogablog.xml" />
<outline title="What's New at the Lunar and Planetary Institute Library" htmlUrl═"http://www.lpi.usra.edu/library/whats_new.shtml" type="rss" xmlUrl═"http://www.lpi.usra.edu/library/new.xml" />
</outline>
</body>
</opml> Example of an OPML file.

Many tools exist to create or process OPML files. For example, it is possible to upload an OPML file or have it read by a service that then makes it available to others or creates the HYML to use as a blogroll. There is an XSLT tool to convert OPML to HTML another uses PERL and still another Python to do the same. Another tool will export the favorites from MS Internet Explorer as an OPML file. Many aggregators and newsreaders have the ability to export and import OPML files.

Providing an OPML file on your site will allow someone to download the collection of sites you feel important. It may also allow mapping of the social network you are a part of. Since their creation is simple, and tools exist to change them into HTML and so create a blogroll for your site.

[edit] Open Content Syndication Directory Format (OCS)

Another format to carry the same information is Open Content Syndication Directory Format (OCS). The OCS format is more flexable being designed specifically to hold lists of Web sites, unlike OPLM that was intended as a tool for display of hierarcaries. It uses the Resource Descriuption Framework (RDF) structure, a complex but powerful tool.

The Open Content Directory Format is intended to provide a concise, machine readable-listing of a set of syndicated services. The directory format is capable of supporting multiple sites, each with multiple services. Each service can have muliple formats such as RSS (RDF Site Summary), XHTML, Plain Text, Avantgo or WML format as well as separate publishing schedules or languages.

The major difference between OPML and OCS is that OPML has one title for one link; OCS can have several links with differing formats for each title. Fewer tools exist for OCS, but some aggregators will export in OCS, Syndirella for example and some blogrolling services will accept OCS files, syndicat8 is one. Although superior to OPML is many ways it is not as widely implemented and far fewer tools exist to create, transform and use OCS data.

[edit] Friend of a Friend (FOAF)

Another metadata tool in the social network area is Friend of a Friend (FOAF) "FOAF is all about creating and using machine-readable homepages that describe people, the links between them and the things they create and do." This is commonly used and has tools to create, manipulate and read the files. The foaf-a-matic will generate a FOAF metadata file in RDF format to add to you site.

FOAF metadata will be of less use in an institutional Web log than a personal weblog. Can a public library really be considered a friend of the parks department? A link connecting the two in either OCS or OPML may be valid but it would stretch the concept of friendship. Maybe the Friends of the Library group or an important contributor could justify a FOAF file pointing to the library but a reciprocal FOAF file pointing back would be nonsense.

The linking and information generated by FOAF metadata is interesting. It is being used on IRC to describe communities in chat rooms. It can be mapped map to show relationships and connections between groups of interest. In the UK the SMS interface to Plink provides FOAF information to mobile phone users.

[edit] XHTML Friends Network (XFN)

XHTML Friends Network (XFN) is more flexible but less widely used. It has a much richer set of relationships but they still exist on the personal rather then the institutional level. It is possible to designate a link as pointing to a parent, child, sweetheart or spouse rather than just a friend.

XFN does not exist as a separate file. Adding tags to an existing blogroll allows each link to have the designated attribute attached. There are some tools to create, read and aggregate XFN friendly sites.

<a href═"http://openstacks.net/os/" rel="colleague" /> Example of XFN markup added to a link.

[edit] Blogchalk

The simplest metadata to add to Web logs is Blogchalk. It is also seems to have very little use. There is one search engine designed to read it, the Blogchalk Search Engine, it seems to have stopped development in 2002. Many of the sites in the search engine no longer exist or have not been updated in a couple of years. There is a tool to create the 2 lines and image that the metadata consists of. One line is a META tag placed in the HEAD section of the site and the other is placed somewhere and visible to readers. The information possible to provide includes, nation, state, city, neighborhood, languages, gender, age, and two interests. The nation is selected from a drop down menu, however the state lacks this standardizing tool. So it is possible for users to enter TX, Tex. And Texas, for example, reducing the possibility of comprehensive searching on that field.

[edit] Metadata for individual posts

Some of the bloging tools provide for keywords or categories to be associated with individual posts. For example, Moveable Type and Nucleus both use categories. You can add a new category as necessary. These categories then provide faceted access to posts.

[edit] Easy News Topics (ENT)

The Resource Description Framework (RDF) and XML Topics Maps (XTM) are both powerful and flexible tools to describe individual posts. Both are very complex and very infrequently used at this level. Maybe in time tools will be developed to make adding metadata to individual items, but that time has yet to arrive. One attempt to create a simpler yet still robust tool is Easy News Topics (ENT).

ENT is an addition to RSS 2.0 that allows topics or subject headings to be added to individual posts. The power of ENT lies in the ability to point to a cloud, a source of topics. This allows a controlled vocabulary to be used with links back to the authority. This allows the merging and ability to distinguish among synonyms. For example, when the topic Mars points to the Astronomy Thesaurus it can be distinguished from those pointing to candy bars, classical gods or the offshore oil platform clouds.

Few tools or services currently exist to make use of Easy News Topics. K-Collector is one, it is a news aggregator and blogging tool built by the people responsible for the standard. This tool allows the writer to add ENT metadata easily to their posts and read it from other RSS feeds using the standard. This product is from a for-profit company and currently is in the testing phase.

[edit] Conclusion

At the present time the side effects of adding metadata will most likely be more beneficial than the number of visitors it brings to your site. The ability to influence the development of emerging Web standards and bring the concerns of libraries to the attention of standard developers is much more important than a few extra hits. The exposure to other metadata standards by cataloging personnel will bring new insights into the MARC/AACR/ISBD standards and may help those develop in a manner most useful in this new information environment. It may also provide them the ability to use less expensive metadata in some instances to the benefit of the organization. Finally, it will build much needed bridges between the library and tech communities. Metadata just might bring a few patrons to your site, some who will be impressed by your knowledgeable use of current technology.

Wikipedia
See also the Wikipedia article on:
Personal tools