USGS - science for a changing world

USGS Thesaurus and Enterprise Web Document Catalog

Frequently Asked Questions about the USGS Thesaurus and catalog

What's a thesaurus?
A thesaurus is a type of controlled vocabulary, a collection of terms.

Terms represent concepts, but it is the concepts themselves and their relationships, not the terms, that constitute the thesaurus.

Terms are related to one another in three different ways:

Hierarchy
A term always has an "is a" relationship with its broader term (BT); a narrower term (NT) can always be said to be "a type of", "a part of", or "an instance of" the parent term.
Preference
For a given concept, one term is chosen as the preferred term or label, and is referred to as the descriptor. Other terms that refer to the same concept are referred to as lead-in terms or non-preferred terms. A non-preferred term is not necessarily a synonym of the preferred term.
Generic relationships
Where concepts are related in some way that cannot be expressed as an "is a" sentence, the thesaurus simply connects one term to another without specifying the nature of the relationship. This is different from more elaborate knowledge-management systems such as topic maps or ontologies in which such generic relationships are always identified and categorized.
Is there only one thesaurus?
No. It is helpful to use different controlled vocabularies for the purposes they serve best. Enterprise Web currently uses the following thesauri:
TypeThesaurus nameInterface
theme USGS Thesaurus PHP
feature Alexandria Digital Library Feature Type Thesaurus PHP
place Common geographic areas PHP
We are experimenting with the following thesauri:
taxa Integrated Taxonomic Information System (ITIS) Web
lithology Lithclass 6.2 PHP
place Coastal and Marine Geology gazetteer PHP
place U.S. National Parks PHP
place U.S. Wildlife Refuges PHP
But the Science Topics page has only about 15 categories!
The introductory page for Science Topics is a static list whose arrangement does not mirror the structure of the thesaurus itself. Instead, it is designed to direct people into the thesaurus without requiring them to understand the structure of the thesaurus beforehand. Consequently that list can be modified to reflect commonly-requested terms or subjects temporarily of popular concern without distorting the logical structure of the thesaurus itself.
What is the purpose of the USGS Thesaurus?
The USGS Thesaurus is specifically intended to help people outside USGS find information on USGS web sites without specific knowledge of the organizational structure and operations of the USGS. For those inside USGS, the thesaurus provides a source of consistent index terms that spans the full range of USGS activities; such terms can be used to refine or clarify labels, to support internet search, and the relationships among them suggest linkages across programs.

Thesaurus interfaces such as Science Topics are not intended to replace traditional search or browse interfaces. In concert with the new "USGS by State" and "About USGS" sites, they supplement existing navigational aids to USGS web information.

Who has worked on the USGS Thesaurus?
The USGS Thesaurus Working Group is composed of specialists in library and information sciences, communications, the natural sciences, scientific software development, and data management. Its purpose is to create and maintain controlled vocabularies, use those vocabularies to create catalogs and indexes, and develop methodology that will help people find and understand online USGS information resources. The group is associated with the USGS Enterprise Web project and coordinates its work with other project tasks as appropriate.
Name Organization Expertise
Linda Broussard Biology-Library Life sciences, records management
Wendy Danchuk Hydrology Cartography, publications
Jeff Dietterle GIO-EWeb
Dave Govoni GIO-EWeb Earth sciences, Information architecture
Irena Kavalek GIO-Library Cataloging & indexing
Peter Schweitzer Geology Earth science, software development

Former personnel

The following people have worked with the group at various times in the past. Their influence is substantial.

Name Organization Expertise
USGS employees
Hylan Beydler Geography-MCMC Land characterization
Nancy Blair GIO-Library Library coordination, cataloging & indexing
Pamela Callais GIO-Library Cataloging & indexing
Liz Ciganovich Water-CAPP Publications
Carmelo Ferrigno GIO-EWeb Information architecture & design
Karen Kaye Biology Information architecture
Celso Puente Water Hydrology
Gary Waggoner Biology-CBI Life sciences
Gail Wendt Communications Hydrology, communication, publications
Consultants and outside reviewers
Linda Hill Alexandria Digital Library, UC Santa Barbara
Gail Hodge Information International Associates, Inc.
Candy Schwartz Graduate School of Library and Information Sciences, Simmons College
Jessica Milstead The JELEM Company
Amy Warner Lexonomy Information Architecture Consulting
How was the thesaurus developed? What other vocabularies did you consult?

Philosophy

Search alone is not sufficient to help people find information. Applications intended to help people find information must also help people understand the scientific, technical, and business context in which it is meaningful. People do not in any usable sense find information without knowing what it is they have found and how it relates to other information.

Design goals

  1. The USGS Thesaurus is designed to conform with a recognized standard, ANSI/NISO Z39.19. This standard has been in widespread use throughout the information science community for many years.
  2. The thesaurus is broad and shallow. It is not intended to enumerate or distinguish the fine details of USGS science, and it is not intended to duplicate detailed search within a scientific database on a particular topic that would ordinarily be provided by a web site developer.
  3. The thesaurus is explicitly intended for use in a web browsing environment. Consequently it is strictly hierarchical. No term has more than one broader term; alternative broader terms are shown as related terms instead. Also the number of top terms is intentionally kept small to enable browse interfaces to function well.
  4. The thesaurus is monolingual. Foreign-language equivalents are possible in principle but have not been incorporated into the current design.
  5. The thesaurus is intended to cover only those facets of information for which other controlled vocabularies were either not available or were not optimal for categorizing USGS information. Consequently the thesaurus does not include place names, types of named geographic features, detailed biological taxonomy, chemical and mineral names, USGS publication series names, or names of organizational units and programs.

Development methods

Specialists recognize two different strategies for building controlled vocabularies: top-down, in which terms and their relationships are defined intuitively prior to their direct application in an indexing situation; and bottom-up, in which terms and relationships are added to the vocabulary in the process of indexing. But the same specialists also recognize that most vocabularies are developed using a combination of these two abstract approaches. We developed the USGS thesaurus using this combined strategy. Beginning by simply listing lots of important terms, we grouped those terms using a card-sorting procedure, and then refined the hierarchy with intuitive processes (that is, by relying on what we know). Subsequent revisions have occurred by group deliberation.

Preliminary development of the thesaurus was conducted using commercial software (MultiTES) by a contractor. Subsequent development and revision has occurred in a web-based database application developed by the group meeting the specific needs of this project.

Review of existing controlled vocabularies

We examined many similar controlled vocabularies of various types before and during this process. Examples are the GEOREF thesaurus produced by the American Geological Institute, the CERES thesaurus ( http://ceres.ca.gov/thesaurus/) the Geographic Names Information System (GNIS), the Integrated Taxonomic Information System (ITIS), the categorization scheme used in the Marine Realms Information Bank (http://mrib.usgs.gov/), and numerous smaller or more specialized vocabularies such as glossaries of scientific and technical terms presented on USGS web sites.

Why isn't the thesaurus divided first by USGS disciplines or regions or programs?
  1. They would imply an isolation of scientific focus that does not accurately reflect the cross-disciplinary work of our researchers.
  2. They mean different things to people inside and outside USGS.
  3. They change. Every year, some programs change names, combine with others, drop out, reappear, or change in scope or focus.
  4. They reflect political and organizational concerns of USGS, DoI, the executive and legislative branches of the government, and sometimes of state and local governments, rather than the concerns of scientists or citizens outside of the political process and organizational politics.
  5. USGS has developed a specialized interface (About USGS) to explain how the Bureau is organized and how it carries out its work as an agency of the government.

We do not deny that the names of funded programs and organizational units are important. We assert instead that this sort of information answers different questions than the ones to which our work is directed. Descriptions of our organization and programs answer the question "How does USGS describe its own organization now and how does it logically group its research and monitoring work now?"

We view these issues from a longer-term perspective. Programs and organizations within USGS have changed and will continue to change through time. What matters in describing the results of our scientific research and monitoring activities is what we studied, how we studied it, and what we found.

But it is possible to describe USGS organizational structures using a formal thesaurus (here is an example) and this can be seen as one way to categorize some of our information, in particular organizational web sites and pages that describe specific funded program activities. We assert that this is not the only way to categorize USGS science, and that a bureaucratic view is especially unsuitable for use by people outside the government.

Why don't you just let the biologists develop the biological terms, the hydrologists develop the hydrological terms, and so on?
Experts in knowledge organization systems emphasize over and over again the need for consistency in choice of terms, how the terms are related to one another and how they are explained. Past experience and, indeed, a survey of the existing USGS web sites, leads us to believe firmly that we can attain consistency only by using an interdisciplinary team that takes an expansive view of its task and draws on expertise in library and information sciences to bring the disparate perspectives of the scientific discipline specialists together.
What should I do when the term I use isn't in the thesaurus?
First make sure that it really isn't there. Try variants of spelling or punctuation in the topic search box, and look at the other controlled vocabularies. For example, types of geographic features will generally not appear in the USGS thesaurus because they are contained in the Alexandria Digital Library Feature Type Thesaurus, which we use as well.

When you find a concept that does not seem to be represented in the thesaurus or a variant of phrasing or punctuation that you feel is common enough to be a lead-in term, please contact the thesaurus working group of the Enterprise Web project. You can contact the thesaurus editorial group, GS_Thesaurus@usgs.gov.

Note that some words or phrases may appear in the thesaurus as "lead-in" terms, meaning terms that a user might enter to find an appropriate descriptor. Lead-in terms will generally refer to the same concept as the descriptor, and may be synonyms. However in some cases a lead-in term is a more specific concept that hasn't been designated a separate category. These lead-in terms are sometimes referred to as "non-preferred" terms in the literature of the library community.

How is the thesaurus stored? Can I get a copy of it?
The USGS thesaurus and the other controlled vocabularies we use are stored in a relational database. The structure of the database is described in detail here. This database system can be accessed by using a web browser or with relational database software (such as Microsoft Access) using an ODBC driver.

Alphabetical listings of the thesauri are generated dynamically:

ITIS is maintained separately by the USDA.

How and when will the thesaurus be revised?
To maintain continuity and consistency, revisions must be presented to the group through an established process and must be consistent with and reflect the nature and purpose of the thesaurus. Revisions will be considered by the Thesaurus Working Group at regular meetings. Revisions may be delegated to subgroups to complete, but such delegation will be done explicitly by the group.

The version currently online is version 2.0. Revisions and changes are sparing but occur continuously.

How should changes be proposed?
Proposed changes to the thesaurus, catalog, or Science Topics interface are dramatically more likely to meet with the approval of the editorial board if they are written with an understanding of the system and its components. Generally changes are of the following types:
Create or modify a descriptor
The descriptor is the text labelling a concept. We will need a good definition or scope statement in order to understand how the new or modified descriptor should be used.
Modify a descriptor's scope note
Scope notes can be modified to reflect USGS interests in the topic or to correct misstatements in existing notes. The text should be informative for the public.
Add a lead-in term for a descriptor
Since many scientific concepts are referred to by many different terms, those that are in common use can be associated with the descriptor so that people who enter the text in a search interface might find the category.
Create a see-also link (relate terms)
Concepts that aren't of the same type or aren't in an "is a" relationship can be linked using the RT concept; this enables additional connections to be made by users.
Create or remove a catalog record for a web site
Catalog records should be created for informative web pages or sites that address important issues or provide useful scientific information. When suggesting a site, feel free to suggest index terms that might be assigned to it.
Assign or remove an index term from a catalog record
Items appear in the Science Topics web site under a given descriptor when the catalog record has been assigned the index term or one of its narrower terms (more specifically any term in the hierarchy below the descriptor). Catalog records are normally assigned the most specific index term that applies, but these can be changed easily.
Changes in the behavior or appearance of the interface
The Science Topics interface is dynamic in the sense that its content is drawn from the database whenever users click its links. For consistency its behavior and appearance are kept constant when the user navigates the hierarchy. Technical details of its operation are available for those who would like to provide constructive comments or suggest improvements in its operation.

Catalog

What documents will be in the catalog?
Generally we have tried to put into the catalog web resources that describe important scientific concepts or report results of important research of USGS. In some cases we also include home pages of USGS organizational units if they provide links to studies that are important to the organization.

The initial collection of records in the catalog was chosen specifically to include a wide variety of different types of USGS information resources. Consequently the results for a given term may include web portals or educational materials alongside highly specialized scientific reports, sites with lots of graphics or data and sites with very few. This diversity has been helpful to the development group by challenging us to assign index terms fairly and consistently. Because it takes a while to index a document well, we tried to restrict our attention to relevant, high-quality information resources. As a result we expect that many of the current records will stay in the catalog. However we believe there are some that may be out of date or for some other reason need to be revisited by their originating organizations, and we anticipate some may disappear.

A related question is "who decides what is to be cataloged". In general the collection development of the catalog, like a library, is expected to rest with the library professionals in our group. However we are open to suggestion and discussion regarding resources that might be included in or removed from the collection.

How are results ranked in Science Topics?
The Thesaurus development group has long recognized an undesirable arbitrariness in the manner in which results are shown to users of the Science Topics interface of the USGS home page.

With the assistance of a few members of the USGS Web Advisory Group or their designees, the thesaurus group has come to agreement on a practical solution to this problem.

Original implementation:

There is no objective measure of relevance, so the database and application software we use cannot automatically arrange results in order of decreasing relevance.

As a rough approximation, the Science Topics interface arranged results in order according to how well the title of an entry matched the category name. Specifically, web sites appeared higher when

  1. title matches the category name,
  2. title begins with the category name,
  3. title contains the category name,
  4. title matches any of the lead-in terms,
  5. title begins with any of those terms,
  6. title contains any of those terms
all other records were listed alphabetically by title.

In sympathy with concerns voiced by highly-placed people within the discipline offices, we created a single special category of catalog entries. Those records declared to be of high priority by a single representative within each science discipline (designated by the chief scientist of the discipline) would be given rank zero and thus would appear before any other records in the results lists of all terms assigned to those entries.

Concerns with the original implementation:

With only one category of high-priority records, there could only be a small number of items so designated.

For records other than those designated high-priority, the ranking criterion (based on words in the title) is simplistic and therefore might not reflect other values by which the sites could be judged.

In early 2005 when the high-priority records methodology was announced, each discipline was asked to provide a small number (6 or so) of sites that should be so designated. Exactly one discipline (WRD) provided a response. For the other disciplines we assumed the home pages of the scientific programs would be the high-priority pages.

Alternative methods considered by the group

  1. Rank results manually (all results for all terms)
  2. Rank results manually, but selectively and sparingly
  3. Create additional special categories
  4. Categorize web sites based on their meeting specified criteria

The group settled on an implementation of method 2, with the following standard operating procedures:

  1. We specify the ranks. We're willing to take input from others but this task has to be done by this group (as a practical matter, this changes the database and so is something only valid database users can do).
  2. Order will be modified only if the number of results is large. In our current implementation I've taken "large" to be more than 20, since that causes the results to be put on more than one page.
  3. Only the top 20 spots can be assigned manually. Everything else is arranged using the current word-match algorithm.
  4. While it's reasonable to place "better" resources higher in the top 20, the important factor is to put the right things into the top 10. We should not entertain extensive argument about the exact ranks within the top 10.
  5. We don't expect to be deluged with requests for changes. Many terms have fewer than 20 records, and most terms for which there are many records are actually poor discriminators (terms that indicate very common concerns or activities).

Consequences of this approach

The main consequence of this approach is that we are able to address concerns with the order of results that are raised by USGS people. We have provided guidelines that limit the procedure to cases where it matters most, and we don't anticipate trouble that we cannot solve by discussion with the site owners.

Where should key words be included in order to ensure that our pages are picked up?

The wording of the question indicates there might be a misunderstanding lurking beneath, so I want to clarify one point first. The thesaurus is not really intended to improve the action of external search engines. The USGS search engine might be tuned to look at keywords in HTML pages (the <meta> tags in the HTML header), but its function will probably continue to be focused on providing full-text search. [USGS guidelines for use of keywords in HTML metadata are currently undergoing revision.] Index terms can indeed be put into web pages, and that may be helpful, but it is not the primary focus of our work to detect those index terms and act on them. Instead, we believe the power of the thesaurus can only be exploited by web interfaces designed specifically to work with index terms and catalog records.

That said, some general guidance can be offered. First and foremost is to specify page titles well. Use clear terms and do not assume that every reader knows the scope within which your page appears. Include county and state in the title if your page is about something in a specific county or state.

Second, consider using preferred terms from the thesaurus if they fit your meaning. Use of terms that are designated as descriptors or lead-in terms in the thesaurus will make it more likely that a search of the USGS web will connect what you have written with what other people have written about the same subject. This applies primarily to headings on the page and the <title> element of the HTML header.

Third, if you manage a glossary or other online vocabulary, consider which terms in your vocabulary link to thesaurus terms and how they relate. Link where it is practical, clarify where your usage is different, and engage the thesaurus working group in conversation about terms on whose meaning or usage we appear to disagree.

Why don't you showcase my (our) program or project?
Part of the problem, in my opinion, is that we have two rather different purposes for putting information on the web:
  • helping people understand the earth, things in it, and processes occurring on it
  • helping people understand and appreciate our work for the Nation

I think both of these motivations are honorable, and as I sit in an SIR program facing a $25M cut next FY, the desire for the second goal to be met is palpable in our hallways.

My chief concern is that these are rather different kinds of information, and I think they shouldn't be mixed up too much, because when someone is trying to understand some process like landslides or invasive species, it's a real distraction to have to keep reading "advertisements" from the organization that's providing you with the information. A cleaner separation of this sort of information will serve everyones' interests better. And I do believe that when we establish a reputation for giving people the right information about natural phenomena, those who understand it will ask how we are working to provide the information, keep it current, and push the boundaries of our understanding through research. But I think for many people this happens only after they get from us what they need to do their own work. To me "relevance" is just that--our results help them do what they need to do, as opposed to just understanding what we do or did.

So if I could step onto the soap-box briefly, I'd exhort our web designers to make their sites and pages answer the two types of questions clearly but separately, and maybe do so by calling out those questions explicitly.

What guidance should we give our research project leads to help them better configure their information so that it complements the Bureau page?
First and foremost, decide whether the page or site is about us or is about the things we study. If you want to build a site that explains our programs or projects, go ahead but don't stop with that. Write pages that help people understand the things you've studied; what they are, how they work, how they relate to other things, both natural and manmade. But do so without overloading the information with advertisements for the USGS.

Second, write good titles and labels. You need to strike a balance between assuming that people have visited many of your pages and assuming that they've dropped in without knowing anything about us. Plain language helps, and with practice, works well even for complex scientific concepts.

Don't assume people know the organizational context of your work. If you have a site describing samples, say what kind of samples they are, and possibly identify the program for which the samples were collected, don't just call the site "Sample Information"--USGS collects lots of samples, not just yours.

Third, identify the main points where people can learn from your site. Apply metadata to those few pages, and point them out to the thesaurus team so we can catalog them.

Accessibility FOIA Privacy Policies and Notices

Take Pride in America logo USA.gov logo U.S. Department of the Interior | U.S. Geological Survey
URL: http://geo-nsdi.er.usgs.gov/thesaurus/catalog/faq/index.shtml
Page Contact Information: Peter Schweitzer
Page Last Modified: Tuesday, 20-Nov-2007 11:10:47 EST