HHG to XML Schema Definition Language in PDS4

From The SBN Wiki
Jump to navigation Jump to search

As explained in the HHG to the eXtensible Markup Language in PDS4, XML is a syntax standard. It defines how to identify markup tags, but it does not define what the tag names are or how they are to be interpreted. In order to do that, an organization like PDS needs to define tag names and their significance for processing. There are a number of ways to do this.

Document Type Definitions were popular for a long while. They could be embedded into the document or provided in a separate file. But DTDs were not the best fit for PDS, where we have fairly elaborate constraints we want programmatic validators to enforce in all labels.

Schema languages are another approach to this tag-definitions problem. PDS decided to use the XML Schema Definition Language (XSD) for defining the content of labels. It is a large, complex, and powerful system. Fortunately, most data preparers will never need to write XSD schema files from scratch, but it useful to know how to read them.

What It Is

XSD is itself an XML-based language (that is, it uses a set of tags defined by the XSD standard to markup the various content requirements and definitions). PDS uses it to translate the PDS4 Information Model into a series of classes, subclasses and attributes with specific content requirements. This schema can then be used to create new labels and to validate the content of the labels created.

XSD includes a set of basic data types, so that you can specify, for example, that the <start_time> attribute has a value that conforms to the ISO time standard format. It also provides ways to restrict these data types, so you can specify that your <photon_count> must be an integer greater than zero but less that the saturation value of your detector.

XSD is strictly ordered. If your XSD schema file says that <start_time> comes right above<stop_time>, but you have it the other way around in your label, the validator will flag it as an error.

What It Is Not

Pretty. XSD specifications tend to be wordy, and sometimes the markup can seem to overwhelm the content. This fades somewhat with familiarity. XML-aware editors also make it easier to navigate and visualize the XSD content.

While XSD is very useful for validating the presence or absence of specific PDS4 classes and attributes, it is not very adept at exclusive-or dependencies ("either this attribute or that one, but not both"), or on validity checks that are contingent on the actual content of one or more elements (e.g., "if the instrument name is 'Wally', then the <mode_ID> must be either 'fast' or 'slow'").

The Basics

Data preparers should not ever need to write or even modify XSD schema files. But you will need to know how to reference them and will likely want to know how to get useful information out of them - in particular, the PDS Master Schema - if you're working on coding label-writing procedures. XML Schema files can be loaded into a browser or any XML-aware editor for viewing. How easy it is to read the file will depend heavily on how sophisticated your browser or editor is, which is beyond the capabilities of this quick guide to cover.

XSD files in PDS4 are generally referenced by the XML Namespace they identify. (You can see examples of namespace referencing in HHG to Namespaces in PDS4.) To turn the namespace reference into a reference to a specific schema file, you may need to include schemaLocation hints, set up an XML Catalog file, or do some other sort of configuration. What configuration will work, or work best, depends on the environment you're working in.

Example

Here's what the example from the HHG to Namespaces in PDS4 might look like if you also include the schemaLocation attribute. The schemaLocation contains a list of pairs where the first element in each pair is the namespace URI (identifier) and the second element is the physical location reference:

  <Product_Observational xmlns="http://pds.nasa.gov/pds4/pds/v11
                         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
                         xmlns:sbn="http://pds.nasa.gov/pds4/sbn/v1" 
                         xmlns:img="http://pds.nasa.gov/pds4/img/v1"   
                         xmlns:bopps="http://pds.nasa.gov/pds4/mission/bopps/v1"  
     xsi:schemaLocation="http://pds.nasa.gov/pds4/pds/v1   https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1900.xsd
                         http://pds.nasa.gov/pds4/sbn/v1   https://pds.nasa.gov/pds4/sbn/v1/PDS4_SBN_1900.xsd 
                         http://pds.nasa.gov/pds4/img/v1   https://pds.nasa.gov/pds4/img/v1/PDS4_IMG_1900.xsd 
                         http://pds.nasa.gov/pds4/mission/bopps/v1   https://pds.nasa.gov/pds4/mission/bopps/v1/PDS4_BOPPS_1100.xsd">

Because PDS decided to format its namespace URIs to look like URLs, the schemaLocation pairs look a bit redundant. The first element is the namespace URI. Note that it begins with "http:" and does not include anything that looks like a file name. The second element is a reference to a physical file. It begins with "https:", because these files are now served via the Secure HTTP protocol; and it ends with a specific file name.

This form of schemaLocation is useful when the software can download the file referenced from the remote site, or when a local XML catalog file is being used to catch references like this and translate them to locations in the local file system. See Understanding XML Catalog Files for more information about the latter option.