USGS Geoscience Data Catalog
With over 600 metadata records, Geo-NSDI is becoming a significant source of well-organized information. One of the consequences of this development is that I now see more of the inconsistencies in the metadata records themselves. But hand-editing 600+ records is a tough job, both because it's time-consuming and because it would be hard to maintain the consistency I seek. Fortunately, many editing tasks can be accomplished using scripts written in Tcl, along with the metadata-handling facilities provided in mq. This report describes several scripts used to evaluate and correct different problems in parseable metadata records. The scripts are provided along with links to intermediate results reported during their operation. The following operations are described:
Most of these metadata records describe data sets that are officially published by USGS in one of its formally-recognized publication series. These series names are normally the values given in the metadata element Series_Name. This element occurs in Citation_Information, which is found in the following elements:
The problem is solved in three steps:
sort is used to sort the names alphabetically.
The result is series.1.
uniq -c is used to eliminate duplicate values and
rank the values by frequency of occurrence.
The result is series.2.
sort -n -r is used to reorder the ranked list with the
values appearing most frequently listed first.
The result is series.3.
cut -f2 is used to remove the number of occurrences of
each value, leaving only the list of unique values, arranged so that
those occurring most frequently are listed first.
The result is series.4.
Using a text editor, the ordered list of unique series names is rearranged so that those names representing the same series are listed consecutively and that the non-preferred forms of the name are indented. The result is series_name.
series_fix.tcl reads the "authority file" created in the previous step. It then searches through subdirectories for metadata records (files whose names end with .met). It reads each file, and replaces the value of Series_Name with the preferred form if the value is one of the non-preferred forms. If the value given in the metadata is not specified in the authority list, then the value is not changed but is printed to stderr.
The result is consistency in the values given in Series_Name, so that a given publication series appears with exactly the same name wherever it is found throughout this collection of metadata.
The parameters of the Universal Transverse Mercator projection are well-defined; given a UTM zone number the remaining parameters can be predicted.
Errors in the entry of these parameters are not uncommon, however. In some cases the errors are due to the misapplication of the term Universal Transverse Mercator to a map that is stored in Transverse Mercator projection, but in many cases the errors appear to spring from a misunderstanding of the regularity of UTM parameters. In still other cases the actual projection used cannot be ascertained with confidence using the metadata alone (meaning the metadata are ambiguous) and the problem must be resolved by examining the data in detail. In a number of cases, the values of False_Easting and False_Northing appear to be reversed.
Consequently it is feasible and worthwhile to check UTM parameters wherever the projection has been declared. The script utm.tcl searches through subdirectories for metadata records, reading each in turn and examining the contents of the Universal_Transverse_Mercator element. Examining the relevant metadata elements, it carries out the following actions according to the problem found and information available:
| Element | Problem | Action taken |
|---|---|---|
| UTM_Zone_Number | Missing | Complain and don't check Longitude_of_Central_Meridian |
| Transverse_Mercator | Missing | Create using UTM_Zone_Number |
| Scale_Factor_at_Central_Meridian | Missing | Create with value 0.9996 |
| Wrong number | Complain but don't change | |
| Not a number | Set value to 0.9996 | |
| Longitude_of_Central_Meridian | Missing | Create using UTM_Zone_Number |
| Wrong number | Complain but don't change | |
| Not a number | Set value using UTM_Zone_Number | |
| Latitude_of_Projection_Origin | Missing | Create with value 0.0 |
| Wrong number | Complain but don't change | |
| Not a number | Set value to 0.0 | |
| False_Easting | Missing | Create with value 500000 |
| Wrong number | Complain but don't change | |
| Not a number | Set value to 500000 | |
| False_Northing | Missing | Create with value 0.0 |
| Wrong number | Complain but don't change | |
| Not a number | Set value to 0.0 |
The element Unrepresentable_Domain should never be used. But because it is present in the Standard, people use it. They always use it to convey information that should be put somewhere else.
The script unrepresentable.tcl searches through subdirectories for metadata records and detects cases where Unrepresentable_Domain is used. Wherever it is found, its value is printed, along with the file name and line number of the element.
By default the script does not save changes to the metadata. If run
with fix specified on the command line, changes are made
to the metadata. The element is immediately deleted if its value
matches any of a number of commonly-used but uninformative values.
If the Attribute_Domain_Values containing this
Unrepresentable_Domain is left empty, it too is deleted.
The uninformative values that are automatically deleted from the
metadata are
Normally the contents of Unrepresentable_Domain belong in Attribute_Definition, but occasionally someone has misunderstood the use of attribute domains and has used this to describe an Enumerated_Domain or a Range_Domain. These cases must be evaluated individually through a careful reading of the enclosing Attribute element.