USGS - science for a changing world

USGS Thesaurus and Enterprise Web Document Catalog

Catalog documentation

Peter N. Schweitzer (U.S. Geological Survey, Reston VA 20192)

The catalog database is currently managed using the MySQL relational database management system. Access to the catalog is built using PHP, a server-side interpreted scripting language in which the code that carries out database searches and other such functions is contained in Web pages.

This document describes separately the structure of the database underlying the catalog and the functional interfaces that enable people to request information contained in it. Functional interfaces can be flexibly designed and extended without necessarily affecting the structure of the tables, fields, and relationships among them. They need not reside on or be served from the machine on which the database is stored, although in the current version they do.

Structure of the database

The database stores two different kinds of information:

It is the relationships among these that gives the catalog its power to answer questions. Relationships among terms within a thesaurus express knowledge of the scientific and technical concepts that are the subject of the thesaurus. Relationships between available information resources and thesaurus terms categorize those resources in ways that are likely to be meaningful to people who want to understand the scientific information that USGS provides. Using both of these types of relationships will help people understand the information we can provide in the context of USGS scientific disciplines, research methods, and topics of investigation.

Understanding the examples

Examples are shown using PHP, an interpreted language designed to run on a Web server in response to users selecting hypertext links on Web pages. The uses shown here are intentionally simple but are syntactically correct. Most are drawn, with minor modification, from the catalog interface code itself.

Here are some general features of the language:
// Comments begin with two slashes and continue
// to the end of the current line

// Variable names begin with $

$i = 0;

// Function calls are indicated by the use of parentheses.

$row = mysql_fetch_row ($result);

// Statements end with semicolon.  Compound statements
// are delineated with braces.  Braces may be omitted
// if a single statement is used.

if ($value == 5) {
    $name = "Bob";
    $address = "10 Downing Street";
    }
else
    $name = "Nancy";

// Array elements are indicated with square brackets.

$code = $row[0];

// Array indexes can be variables

$code[$i] = $row[0];
The code examples here are intended to show how information is retrieved from the database, and do not include statements that display information to the user. Consequently most of these examples show the information stored in local arrays, which could then be presented to the user in a variety of ways.

Controlled vocabularies

The thesaurus table describes formal thesauri. It identifies the thesauri that are available and, for each thesaurus, the database tables that contain its terms, both preferred and non-preferred, and the nonhierarchical relationships.

tagIntegerUnique identifier for this thesaurus
nameText (255)Name of the thesaurus
editionText (16)Edition of the thesaurus
dateText (16)Date of the thesaurus edition
tblnameText (16)Name of the table in which terms are stored
codetypeText (8)Type of unique identifier: "number" or "alpha"
contactIntegerPerson responsible for the thesaurus, links to contact:tag
nonprefText (16)Name of the table in which non-preferred terms are stored
reltermText (16)Name of the table in which non-hierarchical relationships among terms are stored
mdateText (16)Last modification date
mtimeText (16)Last modification time
useridText (16)User making modification (email id)

In this table, values of tag are unique.

Examples

// Given the name of a thesaurus, determine its identifying code
// and the names of the database tables that will be used in SQL
// statements retrieving its terms

$name = "USGS Thesaurus";

$query = "select tag,tblname,nonpref,relterm from thesaurus where name='".addslashes($name)."'";
if (($result = mysql_query ($query,$dbh)) != False) {
	$n = mysql_num_rows ($result);

	// Since it's possible to have more than one version of a thesaurus
	// in the catalog, we specify that we want the last (most recent)
	// version.

	mysql_data_seek ($result,$n-1);

	$row = mysql_fetch_row ($result);
	$th_code    = $row[0];
	$th_name    = $name;
	$th_table   = $row[1];
	$th_uftable = $row[2];
	$th_rttable = $row[3];
	mysql_free_result ($result);
	}

Preferred thesaurus terms are stored in the table named in the field tblname. That table will have these fields:

code Unique identifier for the term
name Text of the term (the descriptor)
parent Unique identifier for the parent term
scope Scope note, including definition (if any)

In this table, values of code are unique.

Examples

(All examples use PHP and its built-in MySQL call interface.)
// Obtain information about a given term:

$query = "select code,name,parent,scope from term where code=$term";
if (($result = mysql_query ($query,$dbh)) != False) {
    $row = mysql_fetch_row ($result);
    $code   = $row[0];
    $name   = $row[1];
    $parent = $row[2];
    $notes  = $row[3];
    mysql_free_result ($result);
    }

// Obtain the narrower terms of a given term:

$query = "select code,name from term where parent=$term";
if (($result = mysql_query ($query,$dbh)) != False) {
    $n = mysql_num_rows ($result);
    for ($i=0; $i < $n; $i++) {
        $row = mysql_fetch_row ($result);
        $ntcode[$i] = $row[0];
        $ntname[$i] = $row[1];
        }
    mysql_free_result ($result);
    }

// Obtain all of the broader terms of a given term:

$parent = $code;
while ($parent != 0) {
    $query = "select code,name,parent from term where code=$parent";
    if (($result = mysql_query ($query,$dbh)) != False) {
        $row = mysql_fetch_row ($result);
        $pcode[$depth] = $row[0];
        $pname[$depth] = $row[1];
        $parent        = $row[2];
        mysql_free_result ($result);
        }
    }

Non-preferred terms are texts representing concepts that are subsumed in the preferred terms. Ordinarily each non-preferred term points to only one preferred term. Our thesaurus contains a few non-preferred terms that represent pre-coordination of two preferred terms, for example geologic maps combines the concepts of geology and maps and atlases; the non-preferred term geologic maps thus points to both of these preferred terms. Conjunctions like this are primarily used as aids in composing search terms. Non-preferred terms are stored in the table named in the field nonpref. That table has these fields:

code Unique identifier for the preferred term
name Text of the non-preferred term
also Unique identifier for the additional preferred term to which this non-preferred term refers
visible Whether this non-preferred term should ordinarily be shown to users

Examples

// Given a non-preferred term, find the preferred term to which it points
$query = "select code,name from term where name like '%$searchtext%'";
if (($result = mysql_query ($query,$dbh)) != False) {
    $n = mysql_num_rows ($result);
    for ($i=0; $i < $n; $i++) {
        $row = mysql_fetch_row ($result);
        $tcode[$i] = $row[0];
        $tname[$i] = $row[1];
        }
    mysql_free_result ($result);
    }

// Find all non-preferred terms for which this term
// is one of the preferred descriptors
$uf = array();
$query = "select name from nonpref where code=$term";
if (($result = mysql_query ($query,$dbh)) != False) {
    $n = mysql_num_rows ($result);
    for ($i=0; $i < $n; $i++) {
        $row = mysql_fetch_row ($result);
        $uf[$i] = $row[0];
        }
    mysql_free_result ($result);
    }


// With the added complexity of the 'also' field, it is
// practical to determine the fuller meaning of the
// non-preferred term, but a more complex data structure
// must be used to keep related information together:

class NonPreferredTerm {
    var $name;
    var $acode;
    var $aname;
    function NonPreferredTerm ($the_name,$the_also) {
        $this->name  = $the_name;
        $this->acode = $the_also;
        $this->aname = '';
        }
    }

$uf = array();
$query = "select name,also from $nonpref where code=$term";
if (($result = mysql_query ($query,$dbh)) != False) {
    $n = mysql_num_rows ($result);
    if ($n > 0)
        for ($i=0; $i < $n; $i++) {
            $row = mysql_fetch_row ($result);
            $name = $row[0];
            $also = $row[1];
            $uf[$i] = new NonPreferredTerm ($name,$also);
            if ($also != '') {
                $q = "select name from $table where code=$also";
                if (($r = mysql_query ($q,$dbh)) != False) {
                    if (mysql_num_rows ($r) > 0) {
                        $rr = mysql_fetch_row ($r);
                        $uf[$i]->aname = $rr[0];
                        }
                    mysql_free_result ($r);
                    }
                }
            }
    mysql_free_result ($result);
    }

Non-hierarchical term relationships are stored in the table named in the field relterm. That table has fields

a Unique identifier of a term
b Unique identifier of related term

Examples

// For a given term, find the related terms

$query = "select b from relterm where a=$term";
if (($result = mysql_query ($query,$dbh)) != False) {
    $n = mysql_num_rows ($result);
    for ($i=0; $i < $n; $i++) {
        $row = mysql_fetch_row ($result);
        $rtcode[$i] = $row[0];

        /* Look up the term record using its code */

        $q = "select name from term where code=$code";
        if (($r = mysql_query ($q,$dbh)) != False) {
            $rr = mysql_fetch_row ($r);
            $rtname[$i] = $rr[0];
            mysql_free_result ($r);
            }
        }
    mysql_free_result ($result);
    }

In addition to geographic overlaps, the database also contains a table specifying the bounding rectangles for each area (an area may have many bounding rectangles) and a table indicating which areas are adjacent. At this writing these tables are not used in the catalog interfaces.

Table bbox: Bounding coordinates for geographic areas

codeUnique identifier for the geographic area
xminMinimum longitude
yminMinimum latitude
xmaxMaximum longitude
ymaxMaximum latitude

Table adjacent: Adjacent geographic areas

area1Unique identifier for one geographic area
area2Unique identifier for an adjacent area

To facilitate development and use of the thesaurus, thesaurus terms may have indexer notes associated with them (table notes):

tag Unique identifier for this note
thesaurus Identifier for the thesaurus from which the term was drawn, links to thesaurus:tag
term Term that is the subject of the note (descriptor)
indexer User authorization name
note Note entered by indexer. Largest is currently 378 bytes
mdate Last modification date
mtime Last modification time
userid User making modification (email id)

Examples

// Get the indexer notes associated with a thesaurus term

$query = "select tag,indexer,mdate,mtime,note from notes where thesaurus=$thcode and term='$name'";
if (($result = mysql_query ($query,$dbh)) != False) {
    $n = mysql_num_rows ($result);
    for ($i=0; $i < $n; $i++) {
        $row = mysql_fetch_row ($result);
        $note_tag[$i]  = $row[0];
        $indexer[$i]   = $row[1];
        $note_date[$i] = $row[2];
        $note_time[$i] = $row[3];
        $note_text[$i] = $row[4];
        }
    mysql_free_result ($result);
    }

Authority lists

While technically controlled vocabularies, these are simply limited sets of terms, without relationships among the terms.

Names of USGS publication series (table series)

tag Unique identifier for this series
name Series name
mdate Last modification date
mtime Last modification time
useridUser making modification (email id)

In this table, values of tag are unique.

Organizations that published the information resource (table publisher)

tag Unique identifier for this publisher
name Name of the publisher
mdate Last modification date
mtime Last modification time
useridUser making modification (email id)

In this table, values of tag are unique.

Documents and their assigned index terms

This document refers to a "catalog" using the analogy of a library card catalog; each card in the catalog describes a book on the shelf in the library. In the same manner, each Web document is described by a row of the table (cite) which contains bibliographic information. The fields of this table generally match elements of the FGDC metadata standard:

tag Unique identifier for this resource
title Title of the resource
pubdate Publication date
sername Publication series, links to series:tag
issue Issue identification
publish Publisher, links to publisher:tag
onlink Online linkage (URL)
email Email address of the page maintainer
descript Description of the document
lang Language of the page (2-letter code)
created Creation date of the page (YYYYMMDD)
revised Revision date of the page (YYYYMMDD)
expires Expiration date of the page (YYYYMMDD)
reviewed Date the page was last reviewed (YYYYMMDD)
mdate Last modification date of this record
mtime Last modification time of this record
userid User making modification to this record (email id)

In this table, values of tag are unique, and values of onlink are also unique.

People or organizations to be cited as authors, editors, or sponsors of an information resource are identified in the associative table originator

cite Unique identifier for the information resource, links to cite:tag
name Name of the person or organization
sequence Priority order of this originator, with respect to others
role Role of this originator: author, compiler, editor, and so on

Examples

// Obtain the citation information for a resource

$query = "select tag,title,pubdate,sername,issue,publish,onlink,email,descript,mdate,mtime,userid from cite where tag=$cite";
if (($result = mysql_query ($query,$dbh)) != False) {
    $row = mysql_fetch_row ($result);
    $tag      = $row[0];
    $title    = $row[1];
    $pubdate  = $row[2];
    $sername  = $row[3];
    $issue    = $row[4];
    $publish  = $row[5];
    $onlink   = $row[6];
    $email    = $row[7];
    $descript = $row[8];
    $mdate    = $row[9];
    $mtime    = $row[10];
    $userid   = $row[11];
    mysql_free_result ($result);

    // Get originators in sequence

    $query = "select name,role from originator where cite=$tag order by sequence";
    if (($result = mysql_query ($query,$dbh)) != False) {
        $n = mysql_num_rows ($result);
        for ($i=0; $i < $n; $i++) {
            $row = mysql_fetch_row ($result);
            $name[$i] = $row[0];
            $role[$i] = $row[1];
            }
        mysql_free_result ($result);
        }

    // Get series name from authority list

    $series_name = '';
    $query = "select name from series where tag=$sername";
    if (($result = mysql_query ($query,$dbh)) != False) {
        $row = mysql_fetch_row ($result);
        $series_name = $row[0];
        mysql_free_result ($result);
        }

    // Get publisher name from authority list

    $publisher_name = '';
    $query = "select name from publisher where tag=$publish";
    if (($result = mysql_query ($query,$dbh)) != False) {
        $row = mysql_fetch_row ($result);
        $publisher_name = $row[0];
        mysql_free_result ($result);
        }
    }

The assignment of an index term to a document is indicated by a row of the table keywords:

cite Unique identifier for the information resource, links to cite:tag
term Term used to categorize the resource
thesaurus Identifier for the thesaurus from which the term was drawn, links to thesaurus:tag
code Unique identifier for the term within the thesaurus term table
mdate Last modification date
mtime Last modification time
userid User making modification (email id)

Examples

// Given a document, retrieve the index terms assigned to it

$query = "select term from keywords where cite=$cite and thesaurus=$thcode order by term";
if (($result = mysql_query ($query,$dbh)) != False) {
    $n = mysql_num_rows ($result);
    for ($i=0; $i < $n; $i++) {
        $row = mysql_fetch_row ($result);
        $term[$i] = $row[0];
        }
    mysql_free_result ($result);
    }

// Given an index term, find the documents to which that
// term has been assigned

$query = "select k.cite,k.term,k.thesaurus,c.tag,c.title,c.onlink from keywords k,cite c where k.thesaurus=$thcode and k.term='$thisterm' and k.cite=c.tag";
if (($result = mysql_query ($query,$dbh)) != False) {
    $n = mysql_num_rows ($result);
    for ($i=0; $i < $n; $i++) {
        $row = mysql_fetch_row ($result);
        $cite[$i]   = $row[0];
        $title[$i]  = $row[4];
        $onlink[$i] = $row[5];
        }
    mysql_free_result ($result);
    }

To improve the responsiveness of the interface, the database has two tables that summarize the documents related to a term. The table inside contains records identifying an index term using its thesaurus and the term's code, and giving the number of records that have been assigned that term or any of the terms that are narrower than it (children, grandchildren, and so on, in the term hierarchy). The table within identifies terms in the same way but provides a reference to each document categorized using that term or any of the terms that are narrower than it. By consulting these tables, the interface can quickly present all of the documents that are relevant to a given thesaurus concept. This idea relies firmly on the notion that hierarchical relationships in the thesauri are "is a" relationships ("is a part of", "is a type of").

Table inside

thesaurus Identifier for the thesaurus from which the term was drawn, links to thesaurus:tag
code term code
count Number of records tagged with that term or any of its narrower terms

There will be one record for each combination of thesaurus and code.

Table within

thesaurus Identifier for the thesaurus from which the term was drawn, links to thesaurus:tag
code term code
cite Identifies a document tagged with that term or any of its narrower terms, links to cite:tag
rank Initial relevance ranking of the document for this term

There will be many records for each combination of thesaurus and code.

Examples

// Given an index term, find the documents to which that
// term or any of its narrower terms (recursively) has
// been assigned

$query = "select cite from within where thesaurus=$thcode and code='$code'";
if (($result = mysql_query ($query,$dbh)) != False) {
    for ($i=0; $i < $n; $i++) {
        $row = mysql_fetch_row ($result);
        $doc[$i] = $row[0];
        }
    mysql_free_result ($result);
    }

Accessibility FOIA Privacy Policies and Notices

Take Pride in America logo USA.gov logo U.S. Department of the Interior | U.S. Geological Survey
URL: http://geo-nsdi.er.usgs.gov/thesaurus/catalog/documentation.html
Page Contact Information: Peter Schweitzer
Page Last Modified: Wednesday, 15-Mar-2006 17:51:47 EST