Setting up a New PDS4 XML Label

From The SBN Wiki
Revision as of 18:18, 13 April 2015 by Raugh (talk | contribs) (Creation - Safety Save)
Jump to navigation Jump to search

There is a fair amount of set-up required at the top of a PDS4 XML label to reference the schemas that define the various namespaces that will be used. There are also potential variations in how this set-up is done that will depend mainly on the working environment of the person who created the label. This page discusses the standard methods and variations you'll find for establishing these connections in labels, and how they work in common environments and tools.

As an example, we'll be using this label file: di_its_example.xml. It's a prototype label developed with the version 1.2 PDS information model to test things like referencing local dictionaries and validation with the PDS4 Validation Tool.

The XML Prolog

The prolog of an XML document may contain the XML declaration, processing instructions, comments, and a document type definition. In XML 1.0, all these things are optional; in XML 1.1 the XML declaration is required. All PDS4 labels will contain both an XML declaration, required by PDS, as well as at least one processing instruction, as it is processing instructions that create the connections to the Schematron part of the namespace definitions. In fact, PDS4 labels will, in general, have one processing instruction for each PDS-controlled namespace referenced in the label.

Here's the prolog from our sample file:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://pds.nasa.gov/pds4/pds/v1/current.sch"?>
<?xml-model href="http://pds.nasa.gov/pds4/disp/v1/current.sch"?>
<?xml-model href="http://pds.nasa.gov/pds4/sp/v1/current.sch"?>
<?xml-model href="http://pds.nasa.gov/pds4/geom/v0/current.sch"?>
<?xml-model href="http://pds.nasa.gov/pds4/sbn/v0/current.sch"?>
<?xml-model href="http://pds.nasa.gov/pds4/mission/epoxi/v0/current.sch"?>

We'll examine this piece by piece.

XML Declaration

The first line is the XML declaration. It defines the XML standard version the label adheres to, and also defines the character set to be used. It has the format of a processing instruction, but very specific content requirements. It must be the very first thing in the file - not even white space may precede it. The XML declaration in our example file is:

<?xml version="1.0" encoding="UTF-8"?>

Here's what's going on:

version="1.0"
Version number is required in your XML declaration. This one declares that the label is following the W3C XML recommendation version 1.0. XML parsers will assume version 1.0 if they get a document without an XML declaration, but PDS will require that you include this statement not only for the XML version, but for the character set which follows it. For PDS4 purposes, the version could also equally well be "1.1". (See the XML Primer for PDS4 page on this wiki if you'd like to know a little more about the version differences.) Also, you can use single quotes around the version number rather than double quotes, if you prefer. Which quote style you choose is not significant, and it can vary through the label.
encoding="UTF-8"
You must also specify which character encoding standard you will be using in the label. The default value stuck in here by various label generators will depend on both your OS and your software. For PDS4 purposes, you should be using "UTF-8". Simply changing the value in the XML declaration, however, will likely not cause your label editing software to start using a different codepage. You'll need to search through your preferences to change that.
Other values you might see here include:
  • "ISO-8859-1" - This is the single-byte "Latin" codepage that maps directly onto the first 256 Unicode characters, which in turn includes the 128 ASCII characters. So for labels that contain only those 256 characters "ISO-8859-1" is equivalent to "UTF-8". Most PDS4 labels will likely fall into that category, but any that contain non-English characters or exotic symbols like degree signs will present problems.
  • "ISO-8859-x" - The related ISO-8859-* code pages contain characters from non-English alphabets in the higher (above 127) locations. There may be characters repeated among these code pages, but they may appear in different places in the different code pages. For archiving, this presents a major problem - so if your software is using one of these code pages, you will definitely need to change a preference or setting somewhere along the line to make sure you end up with a UTF-8 compatible character set for archiving. Note that whether this is even possible, let alone how to do it, will vary wildly with your editor software.

In addition, you might also see a standalone declaration in an XML declaration. It would look like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

A standalone value of "yes" would indicate that everything you need to know about what tags are used in the document is included inside the document. This will never happen in PDS4 labels (the markup is defined by the XSD and SCH schema files external to the label and referenced elsewhere), so if you see a standalone attribute in an XML declaration in a PDS4 label it better have a value of "no", which is also the default. You may see a value of "yes" in an XML file submitted as an archive product, since any XML files included in the archive will require some sort of formal document structure definition, and in some cases it might be convenient to include it as a Document Type Definition (DTD) inside the XML file rather than as a separate DTD or schema file.

Finally, white space not inside quotes is not significant in your XML declaration. White space includes blanks, tabs, and line breaks. So this would also be valid:

<?xml
version="1.0"
encoding="UTF-8"
standalone="no"?>