Setting up a New PDS4 XML Label

From The SBN Wiki
Revision as of 19:21, 13 April 2015 by Raugh (talk | contribs) (Creation - Safety Save)
Jump to navigation Jump to search

There is a fair amount of set-up required at the top of a PDS4 XML label to reference the schemas that define the various namespaces that will be used. There are also potential variations in how this set-up is done that will depend mainly on the working environment of the person who created the label. This page discusses the standard methods and variations you'll find for establishing these connections in labels, and how they work in common environments and tools.

As an example, we'll be using this label file: di_its_example.xml. It's a prototype label developed with the version 1.2 PDS information model to test things like referencing local dictionaries and validation with the PDS4 Validation Tool.

The XML Prolog

The prolog of an XML document may contain the XML declaration, processing instructions, comments, and a document type definition. In XML 1.0, all these things are optional; in XML 1.1 the XML declaration is required. All PDS4 labels will contain both an XML declaration, required by PDS, as well as at least one processing instruction, as it is processing instructions that create the connections to the Schematron part of the namespace definitions. In fact, PDS4 labels will, in general, have one processing instruction for each PDS-controlled namespace referenced in the label.

Here's the prolog from our sample file:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1201.sch"?>
<?xml-model href="http://pds.nasa.gov/pds4/disp/v1/PDS4_DISP_1100.sch"?>
<?xml-model href="http://pds.nasa.gov/pds4/sp/v1/PDS4_SP_1100.sch"?>
<?xml-model href="http://pds.nasa.gov/pds4/geom/v0/PDS4_GEOM_0520.sch"?>
<?xml-model href="http://pds.nasa.gov/pds4/sbn/v0/sbnDD_0100.sch"?>
<?xml-model href="http://pds.nasa.gov/pds4/mission/epoxi/v0/epoxiDD_0100.sch"?>

We'll examine this piece by piece.

XML Declaration

The first line is the XML declaration. It defines the XML standard version the label adheres to, and also defines the character set to be used. It has the format of a processing instruction, but very specific content requirements. It must be the very first thing in the file - not even white space may precede it. The XML declaration in our example file is:

<?xml version="1.0" encoding="UTF-8"?>

Here's what's going on:

version="1.0"
Version number is required in your XML declaration. This one declares that the label is following the W3C XML recommendation version 1.0. XML parsers will assume version 1.0 if they get a document without an XML declaration, but PDS will require that you include this statement not only for the XML version, but for the character set which follows it. For PDS4 purposes, the version could also equally well be "1.1". (See the XML Primer for PDS4 page on this wiki if you'd like to know a little more about the version differences.) Also, you can use single quotes around the version number rather than double quotes, if you prefer. Which quote style you choose is not significant, and it can vary through the label.
encoding="UTF-8"
You must also specify which character encoding standard you will be using in the label. The default value stuck in here by various label generators will depend on both your OS and your software. For PDS4 purposes, you should be using "UTF-8". Simply changing the value in the XML declaration, however, will likely not cause your label editing software to start using a different codepage. You'll need to search through your preferences to change that.
Other values you might see here include:
  • "ISO-8859-1" - This is the single-byte "Latin" codepage that maps directly onto the first 256 Unicode characters, which in turn includes the 128 ASCII characters. So for labels that contain only those 256 characters "ISO-8859-1" is equivalent to "UTF-8". Most PDS4 labels will likely fall into that category, but any that contain non-English characters or exotic symbols like degree signs will present problems.
  • "ISO-8859-x" - The related ISO-8859-* code pages contain characters from non-English alphabets in the higher (above 127) locations. There may be characters repeated among these code pages, but they may appear in different places in the different code pages. For archiving, this presents a major problem - so if your software is using one of these code pages, you will definitely need to change a preference or setting somewhere along the line to make sure you end up with a UTF-8 compatible character set for archiving. Note that whether this is even possible, let alone how to do it, will vary wildly with your editor software.

In addition, you might also see a standalone declaration in an XML declaration. It would look like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

A standalone value of "yes" would indicate that everything you need to know about what tags are used in the document is included inside the document. This will never happen in PDS4 labels (the markup is defined by the XSD and SCH schema files external to the label and referenced elsewhere), so if you see a standalone attribute in an XML declaration in a PDS4 label it better have a value of "no", which is also the default. You may see a value of "yes" in an XML file submitted as an archive product, since any XML files included in the archive will require some sort of formal document structure definition, and in some cases it might be convenient to include it as a Document Type Definition (DTD) inside the XML file rather than as a separate DTD or schema file.

Finally, white space not inside quotes is not significant in your XML declaration. White space includes blanks, tabs, and line breaks. So this would also be valid:

<?xml
version="1.0"
encoding="UTF-8"
standalone="no"?>


xml-model Processing Instruction

The xml-model processing instruction is the focus of a relatively new (first proposed in 2010; last revised 2012) W3C standard "Associating Schemas with XML Documents". It exists to provide an explicit link between an XML document and the schema that defines its valid content. Processing instructions are delimited by the character pairs <? and ?> (same as for the XML declaration).

PDS uses the xml-model processing instruction to associate Schematron-type schema files, specifically, with a label. (The XSD schema files are associated through namespace declarations.) Typically you'll see one of two forms for this instruction in PDS4 labels:

<?xml-model href="http://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1201.sch"?>

as above, or:

<?xml-model href="http://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1201.sch" schematypens="http://purl.oclc.org/dsdl/schematron"?>

which adds a bit of optional information. Here's what's going on:

href="http://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1201.sch"
The href pseudo-attribute (since "xml-model" isn't really an XML tag, its XML attribute isn't really an attribute) is required, and to be compliant with the "Associating Schemas" standard, it must be an Internationalized Resource Identifier (IRI) - that is, a logical reference that can be resolved to a physical location...somehow. However, implementations vary, and you will find that, rather than an IRI, PDS4 labels will always contain a reference to a physical file. There is a very PDS-specific reason for this: PDS-controlled namespaces are not defined by a single file. PDS is using the xml-model instruction specifically and only for Schematron file linking. This has to be done via an explicit file name rather than a more general namespace reference because the XML Catalog file standard routinely used to convert namespaces to physical references necessarily returns a single value; it is not possible to map one namespace IRI to two different physical files.
schematypens="http://purl.oclc.org/dsdl/schematron"
The optional schematypens pseudo-attribute gives any software that cares to check a hint about what kind of schema it can expect to find when it decodes the href value to a physical file. The namespace shown here is the official namespace IRI for ISO Schematron. In the absence of schematypens, any particular processing routine would have to try to decipher the referenced file type by something like file extension or the initial content inside the file.
Note for Eclipse Users: The Eclipse editor and its Schematron plug-in have a couple of significant limitations:
  1. The href value must be a physical file location relative to the label in the current disk space. Web references and URIs will not resolve, even with XML catalog file entries available.
  2. The presence of a schematypens pseudo-attribute will be flagged as an error.

There are other optional pseudo-attributes for xml-model that are unlikely, at least as of this writing, to show up in PDS4 labels, but they do at least have a format definition in the "Associating Schemas" standard. The ones you're most likely to see include:

  • type: The value should be a content-type descriptor like those you would find in an HTTP header.
  • charset: The value specifies a character set using standard abbreviations like "US-ASCII" or "UTF-8".
  • title: The value is the title of the schema document being referenced by href.