USGS Thesaurus and Enterprise Web Document Catalog
Terms represent concepts, but it is the concepts themselves and their relationships, not the terms, that constitute the thesaurus.
Terms are related to one another in three different ways:
| Type | Thesaurus name | Interface |
| theme | USGS Thesaurus | PHP |
| feature | Alexandria Digital Library Feature Type Thesaurus | PHP |
| place | Common geographic areas | PHP |
| We are experimenting with the following thesauri: | ||
| taxa | Integrated Taxonomic Information System (ITIS) | Web |
| lithology | Lithclass 6.2 | PHP |
| place | Coastal and Marine Geology gazetteer | PHP |
| place | U.S. National Parks | PHP |
| place | U.S. Wildlife Refuges | PHP |
Thesaurus interfaces such as Science Topics are not intended to replace traditional search or browse interfaces. In concert with the new "USGS by State" and "About USGS" sites, they supplement existing navigational aids to USGS web information.
| Name | Organization | Expertise |
|---|---|---|
| Linda Broussard | Biology-Library | Life sciences, records management |
| Wendy Danchuk | Hydrology | Cartography, publications |
| Jeff Dietterle | GIO-EWeb | |
| Dave Govoni | GIO-EWeb | Earth sciences, Information architecture |
| Irena Kavalek | GIO-Library | Cataloging & indexing |
| Peter Schweitzer | Geology | Earth science, software development |
The following people have worked with the group at various times in the past. Their influence is substantial.
| Name | Organization | Expertise |
|---|---|---|
| USGS employees | ||
| Hylan Beydler | Geography-MCMC | Land characterization |
| Nancy Blair | GIO-Library | Library coordination, cataloging & indexing |
| Pamela Callais | GIO-Library | Cataloging & indexing |
| Liz Ciganovich | Water-CAPP | Publications |
| Carmelo Ferrigno | GIO-EWeb | Information architecture & design |
| Karen Kaye | Biology | Information architecture |
| Celso Puente | Water | Hydrology |
| Gary Waggoner | Biology-CBI | Life sciences |
| Gail Wendt | Communications | Hydrology, communication, publications |
| Consultants and outside reviewers | ||
| Linda Hill | Alexandria Digital Library, UC Santa Barbara | |
| Gail Hodge | Information International Associates, Inc. | |
| Candy Schwartz | Graduate School of Library and Information Sciences, Simmons College | |
| Jessica Milstead | The JELEM Company | |
| Amy Warner | Lexonomy Information Architecture Consulting | |
Specialists recognize two different strategies for building controlled vocabularies: top-down, in which terms and their relationships are defined intuitively prior to their direct application in an indexing situation; and bottom-up, in which terms and relationships are added to the vocabulary in the process of indexing. But the same specialists also recognize that most vocabularies are developed using a combination of these two abstract approaches. We developed the USGS thesaurus using this combined strategy. Beginning by simply listing lots of important terms, we grouped those terms using a card-sorting procedure, and then refined the hierarchy with intuitive processes (that is, by relying on what we know). Subsequent revisions have occurred by group deliberation.
Preliminary development of the thesaurus was conducted using commercial software (MultiTES) by a contractor. Subsequent development and revision has occurred in a web-based database application developed by the group meeting the specific needs of this project.
We examined many similar controlled vocabularies of various types before and during this process. Examples are the GEOREF thesaurus produced by the American Geological Institute, the CERES thesaurus ( http://ceres.ca.gov/thesaurus/) the Geographic Names Information System (GNIS), the Integrated Taxonomic Information System (ITIS), the categorization scheme used in the Marine Realms Information Bank (http://mrib.usgs.gov/), and numerous smaller or more specialized vocabularies such as glossaries of scientific and technical terms presented on USGS web sites.
We do not deny that the names of funded programs and organizational units are important. We assert instead that this sort of information answers different questions than the ones to which our work is directed. Descriptions of our organization and programs answer the question "How does USGS describe its own organization now and how does it logically group its research and monitoring work now?"
We view these issues from a longer-term perspective. Programs and organizations within USGS have changed and will continue to change through time. What matters in describing the results of our scientific research and monitoring activities is what we studied, how we studied it, and what we found.
But it is possible to describe USGS organizational structures using a formal thesaurus (here is an example) and this can be seen as one way to categorize some of our information, in particular organizational web sites and pages that describe specific funded program activities. We assert that this is not the only way to categorize USGS science, and that a bureaucratic view is especially unsuitable for use by people outside the government.
When you find a concept that does not seem to be represented in the thesaurus or a variant of phrasing or punctuation that you feel is common enough to be a lead-in term, please contact the thesaurus working group of the Enterprise Web project. You can contact the thesaurus editorial group, GS_Thesaurus@usgs.gov.
Note that some words or phrases may appear in the thesaurus as "lead-in" terms, meaning terms that a user might enter to find an appropriate descriptor. Lead-in terms will generally refer to the same concept as the descriptor, and may be synonyms. However in some cases a lead-in term is a more specific concept that hasn't been designated a separate category. These lead-in terms are sometimes referred to as "non-preferred" terms in the literature of the library community.
Alphabetical listings of the thesauri are generated dynamically:
ITIS is maintained separately by the USDA.
The version currently online is version 2.0. Revisions and changes are sparing but occur continuously.
The initial collection of records in the catalog was chosen specifically to include a wide variety of different types of USGS information resources. Consequently the results for a given term may include web portals or educational materials alongside highly specialized scientific reports, sites with lots of graphics or data and sites with very few. This diversity has been helpful to the development group by challenging us to assign index terms fairly and consistently. Because it takes a while to index a document well, we tried to restrict our attention to relevant, high-quality information resources. As a result we expect that many of the current records will stay in the catalog. However we believe there are some that may be out of date or for some other reason need to be revisited by their originating organizations, and we anticipate some may disappear.
A related question is "who decides what is to be cataloged". In general the collection development of the catalog, like a library, is expected to rest with the library professionals in our group. However we are open to suggestion and discussion regarding resources that might be included in or removed from the collection.
With the assistance of a few members of the USGS Web Advisory Group or their designees, the thesaurus group has come to agreement on a practical solution to this problem.
Original implementation:
There is no objective measure of relevance, so the database and application software we use cannot automatically arrange results in order of decreasing relevance.
As a rough approximation, the Science Topics interface arranged results in order according to how well the title of an entry matched the category name. Specifically, web sites appeared higher when
In sympathy with concerns voiced by highly-placed people within the discipline offices, we created a single special category of catalog entries. Those records declared to be of high priority by a single representative within each science discipline (designated by the chief scientist of the discipline) would be given rank zero and thus would appear before any other records in the results lists of all terms assigned to those entries.
Concerns with the original implementation:
With only one category of high-priority records, there could only be a small number of items so designated.
For records other than those designated high-priority, the ranking criterion (based on words in the title) is simplistic and therefore might not reflect other values by which the sites could be judged.
In early 2005 when the high-priority records methodology was announced, each discipline was asked to provide a small number (6 or so) of sites that should be so designated. Exactly one discipline (WRD) provided a response. For the other disciplines we assumed the home pages of the scientific programs would be the high-priority pages.
Alternative methods considered by the group
The group settled on an implementation of method 2, with the following standard operating procedures:
Consequences of this approach
The main consequence of this approach is that we are able to address concerns with the order of results that are raised by USGS people. We have provided guidelines that limit the procedure to cases where it matters most, and we don't anticipate trouble that we cannot solve by discussion with the site owners.
The wording of the question indicates there might be a misunderstanding lurking beneath, so I want to clarify one point first. The thesaurus is not really intended to improve the action of external search engines. The USGS search engine might be tuned to look at keywords in HTML pages (the <meta> tags in the HTML header), but its function will probably continue to be focused on providing full-text search. [USGS guidelines for use of keywords in HTML metadata are currently undergoing revision.] Index terms can indeed be put into web pages, and that may be helpful, but it is not the primary focus of our work to detect those index terms and act on them. Instead, we believe the power of the thesaurus can only be exploited by web interfaces designed specifically to work with index terms and catalog records.
That said, some general guidance can be offered. First and foremost is to specify page titles well. Use clear terms and do not assume that every reader knows the scope within which your page appears. Include county and state in the title if your page is about something in a specific county or state.
Second, consider using preferred terms from the thesaurus if they fit your meaning. Use of terms that are designated as descriptors or lead-in terms in the thesaurus will make it more likely that a search of the USGS web will connect what you have written with what other people have written about the same subject. This applies primarily to headings on the page and the <title> element of the HTML header.
Third, if you manage a glossary or other online vocabulary, consider which terms in your vocabulary link to thesaurus terms and how they relate. Link where it is practical, clarify where your usage is different, and engage the thesaurus working group in conversation about terms on whose meaning or usage we appear to disagree.
I think both of these motivations are honorable, and as I sit in an SIR program facing a $25M cut next FY, the desire for the second goal to be met is palpable in our hallways.
My chief concern is that these are rather different kinds of information, and I think they shouldn't be mixed up too much, because when someone is trying to understand some process like landslides or invasive species, it's a real distraction to have to keep reading "advertisements" from the organization that's providing you with the information. A cleaner separation of this sort of information will serve everyones' interests better. And I do believe that when we establish a reputation for giving people the right information about natural phenomena, those who understand it will ask how we are working to provide the information, keep it current, and push the boundaries of our understanding through research. But I think for many people this happens only after they get from us what they need to do their own work. To me "relevance" is just that--our results help them do what they need to do, as opposed to just understanding what we do or did.
So if I could step onto the soap-box briefly, I'd exhort our web designers to make their sites and pages answer the two types of questions clearly but separately, and maybe do so by calling out those questions explicitly.
Second, write good titles and labels. You need to strike a balance between assuming that people have visited many of your pages and assuming that they've dropped in without knowing anything about us. Plain language helps, and with practice, works well even for complex scientific concepts.
Don't assume people know the organizational context of your work. If you have a site describing samples, say what kind of samples they are, and possibly identify the program for which the samples were collected, don't just call the site "Sample Information"--USGS collects lots of samples, not just yours.
Third, identify the main points where people can learn from your site. Apply metadata to those few pages, and point them out to the thesaurus team so we can catalog them.