Setting up a New PDS4 XML Label

From The SBN Wiki
Revision as of 20:48, 3 August 2017 by Akash (talk | contribs) (→‎The XML Prolog: internal link)
Jump to navigation Jump to search

There is a fair amount of set-up required at the top of a PDS4 XML label to reference the schemas that define the various namespaces that will be used. There are also potential variations in how this set-up is done that will depend mainly on the working environment of the person who created the label. This page discusses the standard methods and variations you'll find for establishing these connections in labels, and how they work in common environments and tools.

As an example, we'll be using this label file: di_its_example.xml (download .zip). It's a prototype label developed with the version 1.2 PDS information model to test things like referencing local dictionaries and validation with the PDS4 Validation Tool.

The XML Prolog

Main article: Anatomy of the XML Prolog

The prolog of an XML document may contain the XML declaration, processing instructions, comments, and a document type definition. In XML 1.0, all these things are optional; in XML 1.1 the XML declaration is required. All PDS4 labels will contain both an XML declaration, required by PDS, as well as at least one processing instruction, as it is processing instructions that create the connections to the Schematron part of the namespace definitions. In fact, PDS4 labels will, in general, have one processing instruction for each PDS-controlled namespace referenced in the label.

Here's the prolog from our sample file:

<?xml version="1.0" encoding="UTF-8"?></nowiki><br/>
<?xml-model href=""?>
<?xml-model href=""?>
<?xml-model href=""?>
<?xml-model href=""?>
<?xml-model href=""?>
<?xml-model href=""?>

We'll examine this piece by piece.

XML Declaration

The first line is the XML declaration. It defines the XML standard version the label adheres to, and also defines the character set to be used. It has the format of a processing instruction, but very specific content requirements. It must be the very first thing in the file - not even white space may precede it. The XML declaration in our example file is:

<?xml version="1.0" encoding="UTF-8"?>

Here's what's going on:

Version number is required in your XML declaration. This one declares that the label is following the W3C XML recommendation version 1.0. XML parsers will assume version 1.0 if they get a document without an XML declaration, but PDS will require that you include this statement not only for the XML version, but for the character set which follows it. For PDS4 purposes, the version could also equally well be "1.1". (See the XML Primer for PDS4 page on this wiki if you'd like to know a little more about the version differences.) Also, you can use single quotes around the version number rather than double quotes, if you prefer. Which quote style you choose is not significant, and it can vary through the label.
You must also specify which character encoding standard you will be using in the label. The default value stuck in here by various label generators will depend on both your OS and your software. For PDS4 purposes, you should be using "UTF-8". Simply changing the value in the XML declaration, however, will likely not cause your label editing software to start using a different codepage. You'll need to search through your preferences to change that.
Other values you might see here include:
  • "ISO-8859-1" - This is the single-byte "Latin" codepage that maps directly onto the first 256 Unicode characters, which in turn includes the 128 ASCII characters. So for labels that contain only those 256 characters "ISO-8859-1" is equivalent to "UTF-8". Most PDS4 labels will likely fall into that category, but any that contain non-English characters or exotic symbols like degree signs will present problems.
  • "ISO-8859-x" - The related ISO-8859-* code pages contain characters from non-English alphabets in the higher (above 127) locations. There may be characters repeated among these code pages, but they may appear in different places in the different code pages. For archiving, this presents a major problem - so if your software is using one of these code pages, you will definitely need to change a preference or setting somewhere along the line to make sure you end up with a UTF-8 compatible character set for archiving. Note that whether this is even possible, let alone how to do it, will vary wildly with your editor software.

In addition, you might also see a standalone declaration in an XML declaration. It would look like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

A standalone value of "yes" would indicate that everything you need to know about what tags are used in the document is included inside the document. This will never happen in PDS4 labels (the markup is defined by the XSD and SCH schema files external to the label and referenced elsewhere), so if you see a standalone attribute in an XML declaration in a PDS4 label it better have a value of "no", which is also the default. You may see a value of "yes" in an XML file submitted as an archive product, since any XML files included in the archive will require some sort of formal document structure definition, and in some cases it might be convenient to include it as a Document Type Definition (DTD) inside the XML file rather than as a separate DTD or schema file.

Finally, white space not inside quotes is not significant in your XML declaration. White space includes blanks, tabs, and line breaks. So this would also be valid:


xml-model Processing Instruction

Processing instructions are delimited by the character pairs <? and ?> (same as for the XML declaration). The xml-model processing instruction is the focus of a relatively new (first proposed in 2010; last revised 2012) W3C standard "Associating Schemas with XML Documents". It exists to provide an explicit link between an XML document and the schema that define(s) its valid content.

PDS uses the xml-model processing instruction to associate Schematron-type schema files, specifically, with a label. (The XSD schema files are associated through namespace declarations.) Typically you'll see one of two forms for this instruction in PDS4 labels:

<?xml-model href=""?>

as above, or:

<?xml-model href="" schematypens=""?>

which adds a bit of optional information. Here's what's going on:

The href pseudo-attribute (since "xml-model" isn't really an XML tag, its XML attribute isn't really an attribute) is required, and to be compliant with the "Associating Schemas" standard, it must be an Internationalized Resource Identifier (IRI) - that is, a logical reference that can be resolved to a physical location...somehow. However, implementations vary, and you will find that, rather than an IRI, PDS4 labels will always contain a reference to a physical file. There is a very PDS-specific reason for this: PDS-controlled namespaces are not defined by a single file. PDS is using the xml-model instruction specifically and only for Schematron file linking. This has to be done via an explicit file name rather than a more general namespace reference because the XML Catalog file standard routinely used to convert namespaces to physical references necessarily returns a single value; it is not possible to map one namespace IRI to two different physical files. However, because the value of href is an IRI, editors that implement the XML Catalog standard along with the schema association standard should use any relevant XML catalog entries to help resolve the href reference. You should keep that in mind when formulating your XML catalog entries.
Also, because PDS4 label files frequently reference multiple namespaces (as in the example file), and because the href value must be a single IRI, you will need to include one xml-model statement for every Schematron file you wish to associate with the label. Start each processing statement on a new line to avoid confusino and trouble down the line.
The optional schematypens pseudo-attribute gives any software that cares to check a hint about what kind of schema it can expect to find when it transforms the IRI in the href value to a physical file. The namespace shown here is the official namespace IRI for ISO Schematron. In the absence of schematypens, any particular processing routine would have to try to decipher the referenced file type by something like file extension or the initial content inside the file.
Note for Eclipse Users: The Eclipse editor and its Schematron plug-in have a couple of significant limitations:
  1. The href value must be a physical file location relative to the label in the current disk space. Web references and URIs will not resolve, even with XML catalog file entries available.
  2. The presence of a schematypens pseudo-attribute will be flagged as an error.

There are other optional pseudo-attributes for xml-model that are unlikely, at least as of this writing, to show up in PDS4 labels, but they do at least have a format definition in the "Associating Schemas" standard. The ones you're most likely to see include:

  • type: The value should be a content-type descriptor like those you would find in an HTTP header.
  • charset: The value specifies a character set using standard abbreviations like "US-ASCII" or "UTF-8".
  • title: The value is the title of the schema document being referenced by href.

Other Prolog Elements

The only other prolog components you should ever find in a PDS4 label would be white space (blank lines) and XML comments (delimited by <!-- and -->), neither of which is required. In XML document files inside the archive, you may find a Document Type Definition (DTD) declaration. A DTD declaration will open with "<!DOCTYPE", and may consist of a reference to an external definition (perhaps a standard DTD like the DocBook DTD), or a series of type definitions statements for elements, attributes, and all the other sorts of things that PDS uses XML Schema files to define.

Document Root

Following the prolog, including any white space or comments, the next thing in the document is the root element tag. For PDS4 labels this tag will begin with "Product_" and will define the type of archive product being labelled. It will also contain a number of declarations to apply to the rest of the root tag content. Here's the root from our example file:

<Product_Observational xmlns=""

Here's what's going on:

The xmlns attribute provides the IRI (permanent, logical identifier) of a namespace. It can also be used, as it is farther down, to assign an abbreviation to a namespace for easier reference within the label. In fact, if your label references more than one namespace, you must provide unique abbreviations for all but one of them. The one without an abbreviation becomes the default namespace for the document. (You can assign abbreviations for all namespaces, if you like - allowing one namespace to have no required abbreviation is a notational convenience.)
Note that the value is not a file name - it is an actual IRI. This is somewhat problematic in PDS, because it takes two schemas (XSD and Schematron) to fully define any version of a namespace, and because of versioning issues with the PDS core namespace that we'll come back to later.