Difference between revisions of "Schema Referencing in PDS4 Labels"

From The SBN Wiki
Jump to navigation Jump to search
Line 75: Line 75:
 
::This is the URI of version 1 of the PDS core namespace. This string will be the same in ''every'' PDS4 label you see until there's a Version 2.0.0.0 of the Information Model.
 
::This is the URI of version 1 of the PDS core namespace. This string will be the same in ''every'' PDS4 label you see until there's a Version 2.0.0.0 of the Information Model.
 
:;<nowiki>http://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1201.xsd</nowiki>
 
:;<nowiki>http://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1201.xsd</nowiki>
::This reference points to the file that contains the definition of the core PDS namespace that will be used for the elements in this label.  Theses sorts of URIs for PDS4 schema files will resolve to a physical file if you simply reference it via the HTTP protocol.  In other words, if you put this string into your web browser, you will see the schema file displayed.  That is ''not'' a requirement of the URI standard, but rather a convenience that PDS has chosen to implement for its PDS4 schema collection.
+
::This reference points to the file that contains the definition of the core PDS namespace that will be used for the elements in this label.  These sorts of URIs for PDS4 schema files will resolve to a physical file if you simply reference it via the HTTP protocol.  In other words, if you put this string into your web browser, you will see the schema file displayed.  That is ''not'' a requirement of the URI standard, but rather a convenience that PDS has chosen to implement for its PDS4 schema collection.
  
  

Revision as of 16:15, 15 April 2015

There are several ways to tie schema documents to the XML files they define in order to validate the documents and take advantage of schema-aware editors; but in general these methods are not compatible with each other. In other words, XML editors need to pick the method they want to use, and then use it consistently. Trying to change methods generally involves changing software settings and/or editing the schema references in the XML files.

The PDS4 schema library is relatively complex and interlinked. That is, the PDS4 dictionary schemas - the ones that define the core PDS and discipline name spaces as well as the mission dictionaries - cross-reference each other. In order for any particular software environment, then, to be able to resolve all schema references reliably, it will be rather important that the same technique be used in all dictionary schemas and all label files, regardless of source. This must also be done in an environment-agnostic way, or you will have to edit schema files each time you try to run validation on a new machine, or even in a new directory in the same disk space.

This page describes how to set up your PDS4 labels to be consistent with the PDS schema library and remove environmental dependence from your schema references as far as possible. This method is very strongly recommended to PDS4 data preparers. In fact, your node consultant may insist on it in order to have consistent and reliable validation of your deliveries.


Preliminaries

PDS-controlled namespaces will almost always be defined by a pair of related schema files: an XML Schema (.xsd) file to define the class structures and general data types; and a Schematron (.sch) file to define enumerated value lists and conditional structure relationships (e.g., you must use PDS attribute A or PDS attribute B, but not both). You will need to tie your labels to both of these files. The Schematron file will be referenced in the XML prolog; the XSD file will be referenced in the document root tag (<Product_Observational>, for example).

Note: Schema File vs. Namespace

URIs (Uniform Resource Identifiers) are used to identify both namespaces and the files that define those namespaces. While it is easy, given the notational conventions described below, to conflate these two things, they are and remain very different concepts to your software. The namespace URI is a logical identifier - it refers to the concept of the dictionary, irrespective of minor version changes. That is, version 1.3 of the PDS core namespace, for example, has exactly the same URI as version 1.5 of the same namespace. (Version 2.0, though, would be a different namespace.)

The schema URIs, however, must resolve to physical files. It is the schema URIs that control the version of the namespace actually applied to the label for editing assistance and for validation.

The practical upshot for PDS4 labels is that when you are referencing a schema file, your URI will contain a file name. When you are referencing a namespace, it will not. And in order to allow for reasonable transportability, file system references will be replaced by URI references that can be resolved through an XML Catalog file.


Schematron (SCH) References

Schematron references are placed in the prolog of the document following the XML declaration. Schematron files are referenced by xml-model processing instructions. (The prolog is everything before the document root tag; processing instructions are delimited by the character pairs <? and ?> -same as for the XML declaration.)

The xml-model processing instruction is the focus of a relatively new (first proposed in 2010; last revised 2012) W3C standard "Associating Schemas with XML Documents". It exists to provide an explicit link between an XML document and a schema that defines its valid content. PDS uses the xml-model processing instruction to associate Schematron-type schema files, specifically, with a label. (The XSD schema files are associated through schemaLocation declarations.)

If your software (your editor, for example) has implemented the "Associating Schemas" standard, then you should use one of these two forms for xml-model in your PDS4 labels:

<?xml-model href="http://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1201.sch"?>

or:

<?xml-model href="http://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1201.sch" schematypens="http://purl.oclc.org/dsdl/schematron"?>

which adds a bit of optional information. Here's what's going on:

href="http://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1201.sch"
href is required, and to be compliant with the "Associating Schemas" standard, it must be a URI that maps to a physical file. However, because the value of href is a URI, editors that implement the XML Catalog standard along with the schema association standard should use any relevant XML catalog entries to help resolve the href reference. You should keep that in mind when formulating your XML catalog entries. Also, note that the href value must be a single URI, so you will need to include one xml-model statement for every Schematron file you wish to associate with the label. Start each xml-model statement on a new line to avoid confusion and trouble down the line.
schematypens="http://purl.oclc.org/dsdl/schematron"
The optional schematypens attribute gives any software that cares to check a hint about what kind of schema it can expect to find when it resolves the href URI to a physical file. The namespace shown here is the official namespace URI for ISO Schematron - the version used in PDS4 dictionaries. In the absence of schematypens, any particular processing routine would have to try to decipher the referenced file type by something like file extension or the initial content inside the file.
Note for Eclipse Users: The Eclipse editor and its Schematron plug-in have a couple of significant limitations:
  1. The href value must be a physical file location relative to the label in the current disk space. Web references and URIs will not resolve, even with XML catalog file entries available, and absolute file references don't seem to work, either. This is a major drawback with Eclipse if you need schema references that are environment-independent.
  2. The presence of a schematypens pseudo-attribute will be flagged as an error.

There are other optional pseudo-attributes for xml-model that are unlikely, at least as of this writing, to show up in PDS4 labels, but they do at least have a format definition in the "Associating Schemas" standard. The ones you're most likely to see include:

  • type: The value should be a content-type descriptor like those you would find in an HTTP header.
  • charset: The value specifies a character set using standard abbreviations like "US-ASCII" or "UTF-8".
  • title: The value is the title of the schema document being referenced by href.


XML Schema (XSD) References

It is possible to oreference .xsd files from various places in your label, but editing and debugging these references tends to be a lot easier when you've got them all in one place. So we recommend you put all your .xsd file references in a schemaLocation list inside the document root tag. For PDS4 labels, the document root tag will be one of the <Product_*> tags.


xsi:schemaLocation

The xsi:schemaLocation attribute that we'll be using inside the document root tag belongs to the XMLSchema-instance namespace, which is in turn defined by the XML Schema standard. There are two elements of this name space you may encounter regularly in PDS labels: the xsi:schemaLocation in the document root tag, and the xsi:nil property that you may see or use in setting some label values to nil in particular circumstances.

Note that, in order to reference elements from the XMLSchema-instance namespace, you have to tell your validators that you plan to do so. You do this the same way you tell your validators about other namespaces. #See below (You don't have to provide a defining schema for the XMLSchema-instance namespace, though, because if your software implements that standard, then the definition will be coded into the system already.)

To link to the relevant .xsd files, you assign a string value to xsi:schemaLocation. This string contains pairs of (namespace URI, schema file URI ) strings (no parentheses or commas in the actual value). You can use blanks and line breaks freely within the value to keep things visually organized for yourself, fortunately. Here's a typical list of (namespace, schema) pairs from a prototype label developed for a Deep Impact spectral image label:

xsi:schemaLocation=
  "http://pds.nasa.gov/pds4/pds/v1            http://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1201.xsd
   http://pds.nasa.gov/pds4/disp/v1           http://pds.nasa.gov/pds4/disp/v1/PDS4_DISP_1100.xsd
   http://pds.nasa.gov/pds4/sp/v1             http://pds.nasa.gov/pds4/sp/v1/PDS4_SP_1100.xsd
   http://pds.nasa.gov/pds4/geom/v0           http://pds.nasa.gov/pds4/geom/v0/PDS4_GEOM_0520.xsd
   http://pds.nasa.gov/pds4/sbn/v0            http://pds.nasa.gov/pds4/sbn/v0/sbnDD_0100.xsd
   http://pds.nasa.gov/pds4/mission/epoxi/v0  http://pds.nasa.gov/pds4/mission/epoxi/v0/epoxiDD_0100.xsd"

The string in the first column, above, is the URI for the namespace. The string in the second column is a URI that will, with the help of either a web connection or an XML Catalog file, resolve to a physical file that can be loaded into the editor or validator. The first pair refers to the core PDS namespace. Here's what's going on:

http://pds.nasa.gov/pds4/pds/v1
This is the URI of version 1 of the PDS core namespace. This string will be the same in every PDS4 label you see until there's a Version 2.0.0.0 of the Information Model.
http://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1201.xsd
This reference points to the file that contains the definition of the core PDS namespace that will be used for the elements in this label. These sorts of URIs for PDS4 schema files will resolve to a physical file if you simply reference it via the HTTP protocol. In other words, if you put this string into your web browser, you will see the schema file displayed. That is not a requirement of the URI standard, but rather a convenience that PDS has chosen to implement for its PDS4 schema collection.


Some things to note for your xsi:schemaLocation list:

  • The PDS core namespace definition should come first. This is because the other PDS-controlled namespaces reference the PDS core name space, but might be referencing a different version of it - most likely an earlier version. You cannot have two simultaneous definitions of the same namespace in force at the same time, and if your software encounters multiple definitions it will typically take the first and warn you about later ones (some programs will let you change that behaviour through preferences, so you should check for that option when you first configure a new editor or validator). You should not depend on discipline dictionaries loading the version of the core schema you want to use. For a start, it makes it difficult to get the right value into the required <information_model_version> element.
  • You should reference all the discipline namespace dictionaries you are using in the label in your xsi:schemaLocation list, even though, technically, you don't have to load any dictionary that is referenced by another dictionary. (The geometry dictionary references the display dictionary, for example, and so will cause the Display Discipline namespace to be loaded whether you include it in your list or not.) The file name for these PDS-controlled namespaces contains the encoded version number, which provides useful information during the development process and can help in trouble-shooting validation errors.
  • You should use the HTTP-style URI for the file references shown above for your schema files wherever possible, because these references can be easily trapped and resolved by simple XML Catalog file entries. This, in turn, makes it possible to validate the same label in different environments without having to change anything in the label itself. (Note that while Eclipse users can make use of this method for XML Schema files, they will not have this luxury with their Schematron references until someone writes a better plug-in.)