USGS - science for a changing world

USGS Geoscience Data Catalog

Using scripts to fix common metadata problems

With over 600 metadata records, Geo-NSDI is becoming a significant source of well-organized information. One of the consequences of this development is that I now see more of the inconsistencies in the metadata records themselves. But hand-editing 600+ records is a tough job, both because it's time-consuming and because it would be hard to maintain the consistency I seek. Fortunately, many editing tasks can be accomplished using scripts written in Tcl, along with the metadata-handling facilities provided in mq. This report describes several scripts used to evaluate and correct different problems in parseable metadata records. The scripts are provided along with links to intermediate results reported during their operation. The following operations are described:

  1. Standardize the names of publication series
  2. Replace missing or incorrect UTM projection parameters
  3. Eliminate Unrepresentable_Domain


  1. Standardize the names of publication series

    Most of these metadata records describe data sets that are officially published by USGS in one of its formally-recognized publication series. These series names are normally the values given in the metadata element Series_Name. This element occurs in Citation_Information, which is found in the following elements:

    • Citation and Cross_Reference (within Identification_Information)
    • Source_Citation (within Data_Quality_Information)

    The problem is solved in three steps:

    1. Find out what names are written in Series_Name.
      1. series_list.tcl searches through sudirectories for metadata records (files whose names end with .met). It reads each file, and prints the value of any Series_Name element it finds. The values are printed one per line; output is redirected to a file. The result is series.0.
      2. sort is used to sort the names alphabetically. The result is series.1.
      3. uniq -c is used to eliminate duplicate values and rank the values by frequency of occurrence. The result is series.2.
      4. sort -n -r is used to reorder the ranked list with the values appearing most frequently listed first. The result is series.3.
      5. cut -f2 is used to remove the number of occurrences of each value, leaving only the list of unique values, arranged so that those occurring most frequently are listed first. The result is series.4.

    2. Choose a preferred form for the name of each series.

      Using a text editor, the ordered list of unique series names is rearranged so that those names representing the same series are listed consecutively and that the non-preferred forms of the name are indented. The result is series_name.

    3. Replace variant names with a preferred form of each name.

      series_fix.tcl reads the "authority file" created in the previous step. It then searches through subdirectories for metadata records (files whose names end with .met). It reads each file, and replaces the value of Series_Name with the preferred form if the value is one of the non-preferred forms. If the value given in the metadata is not specified in the authority list, then the value is not changed but is printed to stderr.

    The result is consistency in the values given in Series_Name, so that a given publication series appears with exactly the same name wherever it is found throughout this collection of metadata.

  2. Replace missing or incorrect UTM projection parameters

    The parameters of the Universal Transverse Mercator projection are well-defined; given a UTM zone number the remaining parameters can be predicted.

    Errors in the entry of these parameters are not uncommon, however. In some cases the errors are due to the misapplication of the term Universal Transverse Mercator to a map that is stored in Transverse Mercator projection, but in many cases the errors appear to spring from a misunderstanding of the regularity of UTM parameters. In still other cases the actual projection used cannot be ascertained with confidence using the metadata alone (meaning the metadata are ambiguous) and the problem must be resolved by examining the data in detail. In a number of cases, the values of False_Easting and False_Northing appear to be reversed.

    Consequently it is feasible and worthwhile to check UTM parameters wherever the projection has been declared. The script utm.tcl searches through subdirectories for metadata records, reading each in turn and examining the contents of the Universal_Transverse_Mercator element. Examining the relevant metadata elements, it carries out the following actions according to the problem found and information available:

    ElementProblemAction taken
    UTM_Zone_NumberMissingComplain and don't check Longitude_of_Central_Meridian
    Transverse_MercatorMissingCreate using UTM_Zone_Number
    Scale_Factor_at_Central_MeridianMissingCreate with value 0.9996
    Wrong numberComplain but don't change
    Not a numberSet value to 0.9996
    Longitude_of_Central_MeridianMissingCreate using UTM_Zone_Number
    Wrong numberComplain but don't change
    Not a numberSet value using UTM_Zone_Number
    Latitude_of_Projection_OriginMissingCreate with value 0.0
    Wrong numberComplain but don't change
    Not a numberSet value to 0.0
    False_EastingMissingCreate with value 500000
    Wrong numberComplain but don't change
    Not a numberSet value to 500000
    False_NorthingMissingCreate with value 0.0
    Wrong numberComplain but don't change
    Not a numberSet value to 0.0
  3. Eliminate Unrepresentable_Domain

    The element Unrepresentable_Domain should never be used. But because it is present in the Standard, people use it. They always use it to convey information that should be put somewhere else.

    The script unrepresentable.tcl searches through subdirectories for metadata records and detects cases where Unrepresentable_Domain is used. Wherever it is found, its value is printed, along with the file name and line number of the element.

    By default the script does not save changes to the metadata. If run with fix specified on the command line, changes are made to the metadata. The element is immediately deleted if its value matches any of a number of commonly-used but uninformative values. If the Attribute_Domain_Values containing this Unrepresentable_Domain is left empty, it too is deleted. The uninformative values that are automatically deleted from the metadata are

    The script can be edited to include additional uninformative values.

    Normally the contents of Unrepresentable_Domain belong in Attribute_Definition, but occasionally someone has misunderstood the use of attribute domains and has used this to describe an Enumerated_Domain or a Range_Domain. These cases must be evaluated individually through a careful reading of the enclosing Attribute element.

Accessibility FOIA Privacy Policies and Notices

Take Pride in America logo USA.gov logo U.S. Department of the Interior | U.S. Geological Survey
URL: http://geo-nsdi.er.usgs.gov/scripts.shtml
Page Contact Information: Peter Schweitzer
Page Last Modified: Friday, 19-Oct-2007 13:26:30 EDT