USGS - science for a changing world

USGS Thesaurus and Enterprise Web Document Catalog

A parser for formal thesauri

tp is a program that parses a formal thesaurus in an indented alphabetical listing, interprets it as a hierarchy of terms, and creates several potentially useful re-expressions of the thesaurus.

Usage

tp input_file [-xml xml_file] [-sql sql_file] [-txt txt_file]

Input

The input file should be plain text. Data consist of term entries separated by blank lines. Each entry begins with the term (descriptor) flush left. Other elements of the entry are indented. Each element of an entry other than the descriptor is prefaced by a role indicator conforming to ANSI/NISO Z39.19:

IndicatorRole
BT:Broader term
NT:Narrower term
RT:Related term
UF:Indicates a non-preferred term for which this is the preferred term
UF+:Indicates non-preferred terms for which this is part of a combination of preferred terms
USE:For non-preferred term, indicates the preferred term to use instead
US+:For non-preferred term, indicates a set of preferred terms to use instead
SN:Scope note
DF:Definition (non-standard)

Example entry

geologic history
  DF:   Record (and inferred reconstruction) of the origin
        and development of the Earth since its formation.
  UF:   chronostratigraphy
        geohistory
  BT:   Earth characteristics
  NT:   biostratigraphy
        Earth history
        lithostratigraphy
  RT:   geology
        paleontology
        paleoseismology
        stratigraphy

Limitations

Output

XML
Extensible Markup Language (XML) output adheres to the document type definition (download) given below:
<!-- Document Type Definition for a hierarchical thesaurus        -->
<!-- Peter N. Schweitzer (U.S. Geological Survey, Reston VA 20192 -->

<!-- A thesaurus consists of zero or more terms.  A version may   -->
<!-- be indicated.                                                -->

<!ELEMENT thesaurus (term*)>
<!ATTLIST thesaurus version CDATA #IMPLIED>

<!-- A term may have a scope-note, related terms, used-for terms, -->
<!-- and narrower terms.  It will always have a name (the text    -->
<!-- of the descriptor itself) and a unique identifier (used for  -->
<!-- resolving RT references within this document).               -->

<!ELEMENT term (scope-note?, related-term*, used-for*, term*)>
<!ATTLIST term name CDATA #REQUIRED
               id   ID    #REQUIRED>

<!-- Scope notes are just text                                    -->
<!ELEMENT scope-note (#PCDATA)>

<!-- Related terms have name and a reference to another term.     -->
<!-- The term reference is authoritative; the name is optional.   -->
<!ELEMENT related-term EMPTY>
<!ATTLIST related-term name  CDATA #IMPLIED
                       idref IDREF #REQUIRED>

<!-- Used-for terms have the text of the non-preferred term       -->
<!ELEMENT used-for (with*)>
<!ATTLIST used-for name CDATA #REQUIRED>

<!-- If the non-preferred term is best described by using two or  -->
<!-- more preferred terms in combination, then the additional     -->
<!-- preferred terms are specified using "with" members of the    -->
<!-- used-for element.                                            -->

<!ELEMENT with EMPTY>
<!ATTLIST with name  CDATA #IMPLIED
               idref IDREF #REQUIRED>

<!-- end -->
SQL
Structured Query Language (SQL) to create three database tables and insert the thesaurus information into their rows. The primary table is named term and contains the preferred terms, scope notes, and hierarchical relationships. The table relterm is an associative table that provides related-term links (non-hierarchical relationships). The table nonpref relates non-preferred terms and phrases to descriptors (terms preferred in the thesaurus). The create statements are as follows:
create table term (
    code   int not null primary key,
    name   varchar(128),
    parent int,
    scope  text
    );

create table relterm (
    a int not null,
    b int not null
    );

create table nonpref (
    code int not null,
    name varchar(128)
    );
TXT
The textual output is an indented list of preferred terms only.

Downloads

Type Platform Download Size (bytes)
Source code (all) tp-src.zip 14K
Executable Microsoft Windows tp.exe 31K
Executable Linux (x86) tp 31K
DTD for XML output (all) thtree.dtd 1.7K

Technical contact

    Peter N. Schweitzer
    Mail Stop 954, National Center
    U.S. Geological Survey
    Reston, VA 20192

    Tel: (703) 648-6533
    FAX: (703) 648-6252
    email: pschweitzer@usgs.gov

Accessibility FOIA Privacy Policies and Notices

Take Pride in America logo USA.gov logo U.S. Department of the Interior | U.S. Geological Survey
URL: http://geo-nsdi.er.usgs.gov/thesaurus/parser/tp.shtml
Page Contact Information: Peter Schweitzer
Page Last Modified: Monday, 17-Apr-2006 16:43:47 EDT