USGS Thesaurus and Enterprise Web Document Catalog
Peter N. Schweitzer (U.S. Geological Survey, Reston VA 20192)
The catalog database is currently managed using the MySQL relational database management system. Access to the catalog is built using PHP, a server-side interpreted scripting language in which the code that carries out database searches and other such functions is contained in Web pages.
This document describes separately the structure of the database underlying the catalog and the functional interfaces that enable people to request information contained in it. Functional interfaces can be flexibly designed and extended without necessarily affecting the structure of the tables, fields, and relationships among them. They need not reside on or be served from the machine on which the database is stored, although in the current version they do.
The database stores two different kinds of information:
Understanding the examplesExamples are shown using PHP, an interpreted language designed to run on a Web server in response to users selecting hypertext links on Web pages. The uses shown here are intentionally simple but are syntactically correct. Most are drawn, with minor modification, from the catalog interface code itself. Here are some general features of the language:
// Comments begin with two slashes and continue
// to the end of the current line
// Variable names begin with $
$i = 0;
// Function calls are indicated by the use of parentheses.
$row = mysql_fetch_row ($result);
// Statements end with semicolon. Compound statements
// are delineated with braces. Braces may be omitted
// if a single statement is used.
if ($value == 5) {
$name = "Bob";
$address = "10 Downing Street";
}
else
$name = "Nancy";
// Array elements are indicated with square brackets.
$code = $row[0];
// Array indexes can be variables
$code[$i] = $row[0];
The code examples here are intended to show how information is
retrieved from the database, and do not include statements that
display information to the user. Consequently most of these
examples show the information stored in local arrays, which
could then be presented to the user in a variety of ways.
|
The thesaurus table describes formal thesauri. It identifies the thesauri that are available and, for each thesaurus, the database tables that contain its terms, both preferred and non-preferred, and the nonhierarchical relationships.
| tag | Unique identifier for this thesaurus | |
| name | Name of the thesaurus | |
| edition | Edition of the thesaurus | |
| date | Date of the thesaurus edition | |
| tblname | Name of the table in which terms are stored | |
| codetype | Type of unique identifier: "number" or "alpha" | |
| contact | Person responsible for the thesaurus, links to contact:tag | |
| nonpref | Name of the table in which non-preferred terms are stored | |
| relterm | Name of the table in which non-hierarchical relationships among terms are stored | |
| mdate | Last modification date | |
| mtime | Last modification time | |
| userid | User making modification (email id) |
In this table, values of tag are unique.
Examples
// Given the name of a thesaurus, determine its identifying code
// and the names of the database tables that will be used in SQL
// statements retrieving its terms
$name = "USGS Thesaurus";
$query = "select tag,tblname,nonpref,relterm from thesaurus where name='".addslashes($name)."'";
if (($result = mysql_query ($query,$dbh)) != False) {
$n = mysql_num_rows ($result);
// Since it's possible to have more than one version of a thesaurus
// in the catalog, we specify that we want the last (most recent)
// version.
mysql_data_seek ($result,$n-1);
$row = mysql_fetch_row ($result);
$th_code = $row[0];
$th_name = $name;
$th_table = $row[1];
$th_uftable = $row[2];
$th_rttable = $row[3];
mysql_free_result ($result);
}
|
Preferred thesaurus terms are stored in the table named in the field tblname. That table will have these fields:
| code | Unique identifier for the term |
| name | Text of the term (the descriptor) |
| parent | Unique identifier for the parent term |
| scope | Scope note, including definition (if any) |
In this table, values of code are unique.
Examples(All examples use PHP and its built-in MySQL call interface.)
// Obtain information about a given term:
$query = "select code,name,parent,scope from term where code=$term";
if (($result = mysql_query ($query,$dbh)) != False) {
$row = mysql_fetch_row ($result);
$code = $row[0];
$name = $row[1];
$parent = $row[2];
$notes = $row[3];
mysql_free_result ($result);
}
// Obtain the narrower terms of a given term:
$query = "select code,name from term where parent=$term";
if (($result = mysql_query ($query,$dbh)) != False) {
$n = mysql_num_rows ($result);
for ($i=0; $i < $n; $i++) {
$row = mysql_fetch_row ($result);
$ntcode[$i] = $row[0];
$ntname[$i] = $row[1];
}
mysql_free_result ($result);
}
// Obtain all of the broader terms of a given term:
$parent = $code;
while ($parent != 0) {
$query = "select code,name,parent from term where code=$parent";
if (($result = mysql_query ($query,$dbh)) != False) {
$row = mysql_fetch_row ($result);
$pcode[$depth] = $row[0];
$pname[$depth] = $row[1];
$parent = $row[2];
mysql_free_result ($result);
}
}
|
Non-preferred terms are texts representing concepts that are subsumed in the preferred terms. Ordinarily each non-preferred term points to only one preferred term. Our thesaurus contains a few non-preferred terms that represent pre-coordination of two preferred terms, for example geologic maps combines the concepts of geology and maps and atlases; the non-preferred term geologic maps thus points to both of these preferred terms. Conjunctions like this are primarily used as aids in composing search terms. Non-preferred terms are stored in the table named in the field nonpref. That table has these fields:
| code | Unique identifier for the preferred term |
| name | Text of the non-preferred term |
| also | Unique identifier for the additional preferred term to which this non-preferred term refers |
| visible | Whether this non-preferred term should ordinarily be shown to users |
Examples
// Given a non-preferred term, find the preferred term to which it points
$query = "select code,name from term where name like '%$searchtext%'";
if (($result = mysql_query ($query,$dbh)) != False) {
$n = mysql_num_rows ($result);
for ($i=0; $i < $n; $i++) {
$row = mysql_fetch_row ($result);
$tcode[$i] = $row[0];
$tname[$i] = $row[1];
}
mysql_free_result ($result);
}
// Find all non-preferred terms for which this term
// is one of the preferred descriptors
$uf = array();
$query = "select name from nonpref where code=$term";
if (($result = mysql_query ($query,$dbh)) != False) {
$n = mysql_num_rows ($result);
for ($i=0; $i < $n; $i++) {
$row = mysql_fetch_row ($result);
$uf[$i] = $row[0];
}
mysql_free_result ($result);
}
// With the added complexity of the 'also' field, it is
// practical to determine the fuller meaning of the
// non-preferred term, but a more complex data structure
// must be used to keep related information together:
class NonPreferredTerm {
var $name;
var $acode;
var $aname;
function NonPreferredTerm ($the_name,$the_also) {
$this->name = $the_name;
$this->acode = $the_also;
$this->aname = '';
}
}
$uf = array();
$query = "select name,also from $nonpref where code=$term";
if (($result = mysql_query ($query,$dbh)) != False) {
$n = mysql_num_rows ($result);
if ($n > 0)
for ($i=0; $i < $n; $i++) {
$row = mysql_fetch_row ($result);
$name = $row[0];
$also = $row[1];
$uf[$i] = new NonPreferredTerm ($name,$also);
if ($also != '') {
$q = "select name from $table where code=$also";
if (($r = mysql_query ($q,$dbh)) != False) {
if (mysql_num_rows ($r) > 0) {
$rr = mysql_fetch_row ($r);
$uf[$i]->aname = $rr[0];
}
mysql_free_result ($r);
}
}
}
mysql_free_result ($result);
}
|
Non-hierarchical term relationships are stored in the table named in the field relterm. That table has fields
| a | Unique identifier of a term |
| b | Unique identifier of related term |
Examples
// For a given term, find the related terms
$query = "select b from relterm where a=$term";
if (($result = mysql_query ($query,$dbh)) != False) {
$n = mysql_num_rows ($result);
for ($i=0; $i < $n; $i++) {
$row = mysql_fetch_row ($result);
$rtcode[$i] = $row[0];
/* Look up the term record using its code */
$q = "select name from term where code=$code";
if (($r = mysql_query ($q,$dbh)) != False) {
$rr = mysql_fetch_row ($r);
$rtname[$i] = $rr[0];
mysql_free_result ($r);
}
}
mysql_free_result ($result);
}
|
In addition to geographic overlaps, the database also contains a table specifying the bounding rectangles for each area (an area may have many bounding rectangles) and a table indicating which areas are adjacent. At this writing these tables are not used in the catalog interfaces.
Table bbox: Bounding coordinates for geographic areas
| code | Unique identifier for the geographic area |
| xmin | Minimum longitude |
| ymin | Minimum latitude |
| xmax | Maximum longitude |
| ymax | Maximum latitude |
Table adjacent: Adjacent geographic areas
| area1 | Unique identifier for one geographic area |
| area2 | Unique identifier for an adjacent area |
To facilitate development and use of the thesaurus, thesaurus terms may have indexer notes associated with them (table notes):
| tag | Unique identifier for this note |
| thesaurus | Identifier for the thesaurus from which the term was drawn, links to thesaurus:tag |
| term | Term that is the subject of the note (descriptor) |
| indexer | User authorization name |
| note | Note entered by indexer. Largest is currently 378 bytes |
| mdate | Last modification date |
| mtime | Last modification time |
| userid | User making modification (email id) |
Examples
// Get the indexer notes associated with a thesaurus term
$query = "select tag,indexer,mdate,mtime,note from notes where thesaurus=$thcode and term='$name'";
if (($result = mysql_query ($query,$dbh)) != False) {
$n = mysql_num_rows ($result);
for ($i=0; $i < $n; $i++) {
$row = mysql_fetch_row ($result);
$note_tag[$i] = $row[0];
$indexer[$i] = $row[1];
$note_date[$i] = $row[2];
$note_time[$i] = $row[3];
$note_text[$i] = $row[4];
}
mysql_free_result ($result);
}
|
While technically controlled vocabularies, these are simply limited sets of terms, without relationships among the terms.
Names of USGS publication series (table series)
| tag | Unique identifier for this series |
| name | Series name |
| mdate | Last modification date |
| mtime | Last modification time |
| userid | User making modification (email id) |
In this table, values of tag are unique.
Organizations that published the information resource (table publisher)
| tag | Unique identifier for this publisher |
| name | Name of the publisher |
| mdate | Last modification date |
| mtime | Last modification time |
| userid | User making modification (email id) |
In this table, values of tag are unique.
This document refers to a "catalog" using the analogy of a library card catalog; each card in the catalog describes a book on the shelf in the library. In the same manner, each Web document is described by a row of the table (cite) which contains bibliographic information. The fields of this table generally match elements of the FGDC metadata standard:
| tag | Unique identifier for this resource |
| title | Title of the resource |
| pubdate | Publication date |
| sername | Publication series, links to series:tag |
| issue | Issue identification |
| publish | Publisher, links to publisher:tag |
| onlink | Online linkage (URL) |
| Email address of the page maintainer | |
| descript | Description of the document |
| lang | Language of the page (2-letter code) |
| created | Creation date of the page (YYYYMMDD) |
| revised | Revision date of the page (YYYYMMDD) |
| expires | Expiration date of the page (YYYYMMDD) |
| reviewed | Date the page was last reviewed (YYYYMMDD) |
| mdate | Last modification date of this record |
| mtime | Last modification time of this record |
| userid | User making modification to this record (email id) |
In this table, values of tag are unique, and values of onlink are also unique.
People or organizations to be cited as authors, editors, or sponsors of an information resource are identified in the associative table originator
| cite | Unique identifier for the information resource, links to cite:tag |
| name | Name of the person or organization |
| sequence | Priority order of this originator, with respect to others |
| role | Role of this originator: author, compiler, editor, and so on |
Examples
// Obtain the citation information for a resource
$query = "select tag,title,pubdate,sername,issue,publish,onlink,email,descript,mdate,mtime,userid from cite where tag=$cite";
if (($result = mysql_query ($query,$dbh)) != False) {
$row = mysql_fetch_row ($result);
$tag = $row[0];
$title = $row[1];
$pubdate = $row[2];
$sername = $row[3];
$issue = $row[4];
$publish = $row[5];
$onlink = $row[6];
$email = $row[7];
$descript = $row[8];
$mdate = $row[9];
$mtime = $row[10];
$userid = $row[11];
mysql_free_result ($result);
// Get originators in sequence
$query = "select name,role from originator where cite=$tag order by sequence";
if (($result = mysql_query ($query,$dbh)) != False) {
$n = mysql_num_rows ($result);
for ($i=0; $i < $n; $i++) {
$row = mysql_fetch_row ($result);
$name[$i] = $row[0];
$role[$i] = $row[1];
}
mysql_free_result ($result);
}
// Get series name from authority list
$series_name = '';
$query = "select name from series where tag=$sername";
if (($result = mysql_query ($query,$dbh)) != False) {
$row = mysql_fetch_row ($result);
$series_name = $row[0];
mysql_free_result ($result);
}
// Get publisher name from authority list
$publisher_name = '';
$query = "select name from publisher where tag=$publish";
if (($result = mysql_query ($query,$dbh)) != False) {
$row = mysql_fetch_row ($result);
$publisher_name = $row[0];
mysql_free_result ($result);
}
}
|
The assignment of an index term to a document is indicated by a row of the table keywords:
| cite | Unique identifier for the information resource, links to cite:tag |
| term | Term used to categorize the resource |
| thesaurus | Identifier for the thesaurus from which the term was drawn, links to thesaurus:tag |
| code | Unique identifier for the term within the thesaurus term table |
| mdate | Last modification date |
| mtime | Last modification time |
| userid | User making modification (email id) |
Examples
// Given a document, retrieve the index terms assigned to it
$query = "select term from keywords where cite=$cite and thesaurus=$thcode order by term";
if (($result = mysql_query ($query,$dbh)) != False) {
$n = mysql_num_rows ($result);
for ($i=0; $i < $n; $i++) {
$row = mysql_fetch_row ($result);
$term[$i] = $row[0];
}
mysql_free_result ($result);
}
// Given an index term, find the documents to which that
// term has been assigned
$query = "select k.cite,k.term,k.thesaurus,c.tag,c.title,c.onlink from keywords k,cite c where k.thesaurus=$thcode and k.term='$thisterm' and k.cite=c.tag";
if (($result = mysql_query ($query,$dbh)) != False) {
$n = mysql_num_rows ($result);
for ($i=0; $i < $n; $i++) {
$row = mysql_fetch_row ($result);
$cite[$i] = $row[0];
$title[$i] = $row[4];
$onlink[$i] = $row[5];
}
mysql_free_result ($result);
}
|
To improve the responsiveness of the interface, the database has two tables that summarize the documents related to a term. The table inside contains records identifying an index term using its thesaurus and the term's code, and giving the number of records that have been assigned that term or any of the terms that are narrower than it (children, grandchildren, and so on, in the term hierarchy). The table within identifies terms in the same way but provides a reference to each document categorized using that term or any of the terms that are narrower than it. By consulting these tables, the interface can quickly present all of the documents that are relevant to a given thesaurus concept. This idea relies firmly on the notion that hierarchical relationships in the thesauri are "is a" relationships ("is a part of", "is a type of").
Table inside
| thesaurus | Identifier for the thesaurus from which the term was drawn, links to thesaurus:tag |
| code | term code |
| count | Number of records tagged with that term or any of its narrower terms |
There will be one record for each combination of thesaurus and code.
Table within
| thesaurus | Identifier for the thesaurus from which the term was drawn, links to thesaurus:tag |
| code | term code |
| cite | Identifies a document tagged with that term or any of its narrower terms, links to cite:tag |
| rank | Initial relevance ranking of the document for this term |
There will be many records for each combination of thesaurus and code.
Examples
// Given an index term, find the documents to which that
// term or any of its narrower terms (recursively) has
// been assigned
$query = "select cite from within where thesaurus=$thcode and code='$code'";
if (($result = mysql_query ($query,$dbh)) != False) {
for ($i=0; $i < $n; $i++) {
$row = mysql_fetch_row ($result);
$doc[$i] = $row[0];
}
mysql_free_result ($result);
}
|