Anatomy of the XML Prolog

From The SBN Wiki
Jump to navigation Jump to search

The prolog of an XML document comprises everything from the start of the file to the document root tag. It may contain the XML declaration, processing instructions, comments, and a document type definition. In XML 1.0, all these things are optional; in XML 1.1 the XML declaration is required.

All PDS4 labels will contain both an XML declaration, required by PDS, as well as at least one processing instruction, as it is processing instructions that create the connections to the Schematron parts of namespace definitions. In fact, PDS4 labels will, in general, have one processing instruction for each PDS-controlled namespace referenced in the label.

Example PDS4 Label Prolog

Here's a sample prolog from an early prototype label that references the PDS4 core namespace, four discipline dictionaries, and a mission dictionary:

   <?xml version="1.0" encoding="UTF-8"?>
   <?xml-model href="http://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1201.sch"?>
   <?xml-model href="http://pds.nasa.gov/pds4/disp/v1/PDS4_DISP_1100.sch"?>
   <?xml-model href="http://pds.nasa.gov/pds4/sp/v1/PDS4_SP_1100.sch"?>
   <?xml-model href="http://pds.nasa.gov/pds4/geom/v0/PDS4_GEOM_0520.sch"?>
   <?xml-model href="http://pds.nasa.gov/pds4/sbn/v0/sbnDD_0100.sch"?>
   <?xml-model href="http://pds.nasa.gov/pds4/mission/epoxi/v0/epoxiDD_0100.sch"?>

We'll examine this piece by piece.

XML Declaration

The first line is the XML declaration. It defines the XML standard version the label adheres to, and also defines the character set to be used. It has the format of a processing instruction, but very specific content requirements. It must be the very first thing in the file - not even white space may precede it. The XML declaration in our example prolog is:

<?xml version="1.0" encoding="UTF-8"?>

Here's what's going on:

version="1.0"
Version number is required in your XML declaration. This one declares that the label is following the W3C XML recommendation version 1.0. XML parsers will assume version 1.0 if they get a document without an XML declaration, but PDS will require that you include this statement not only for the XML version, but for the character set which follows it. For PDS4 purposes, the version could also equally well be "1.1". (See the XML Standards Primer for PDS4 page on this wiki if you'd like to know a little more about the version differences.) Also, you can use single quotes around the version number rather than double quotes, if you prefer. Which quote style you choose is not significant, and it can vary through the label.
encoding="UTF-8"
You must also specify which character encoding standard you will be using in the label. The default value stuck in here by various label generators will depend on your software. For PDS4 purposes, you should be using "UTF-8". (The XML standard also requires that all conformant software implement UTF-8 support.) Simply changing the value in the XML declaration, however, will likely not cause your label editing software to start using a different encoding. You'll need to search through your preferences to change that.
The valid values than might appear here are defined by the IANA Official Names for Character Sets standard. Common values you might see include:
  • "ISO-8859-1" - This is the single-byte "Latin" codepage that maps the first 256 Unicode characters, which in turn include the 127 ASCII characters, to a single-byte value. It is not equivalent to either US-ASCII or UTF-8 for characters beyond the 127 ASCII characters.
  • "ISO-8859-x" - The related ISO-8859-* code pages contain characters from non-English alphabets in the higher (above 127) locations. There may be non-English characters in common among these code pages, but they will likely appear in different places in the different code pages.
  • "US-ASCII" - Once again, the first 127 bytes correspond to the ASCII character set, but anything beyond that may vary from other encodings, both in content and position.
In all these cases, if you know there are no bytes with values greater than 127 then you can change the encoding value to "UTF-8". But if there are any higher-order characters in the file you will need to convert the file to UTF-8 prior to archiving. (Some editors can do this as a "Save-as" function; some treat it as a font-related conversion.)

In addition, you might also see a standalone declaration in an XML declaration. It would look like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

A standalone value of "yes" would indicate that everything you need to know about what tags are used in the document is included inside the document. This will never happen in PDS4 labels (the markup is defined by the XSD and SCH schema files external to the label and referenced elsewhere), so if you see a standalone attribute in an XML declaration in a PDS4 label it better have a value of "no", which is also the default. You may see a value of "yes" in an XML file submitted as an archive product, since any XML files included in the archive will require some sort of formal document structure definition, and in some cases it might be convenient to include it as a Document Type Definition (DTD) inside the XML file rather than as a separate DTD or schema file.

Finally, white space not inside quotes is not significant in your XML declaration. White space includes blanks, tabs, and line breaks. So this would also be valid:

   <?xml 
      version="1.0"
      encoding="UTF-8"
      standalone="no"?>

xml-model Processing Instruction

Processing instructions are delimited by the character pairs <? and ?> (same as for the XML declaration). The xml-model processing instruction is the focus of a relatively new (first proposed in 2010; last revised 2012) W3C standard "Associating Schemas with XML Documents". It exists to provide an explicit link between an XML document and the schema that define(s) its valid content. The "Schema Referencing in PDS4 Labels" page on this wiki provides complete documentation and instructions on formulating <?xml-model?> processing instructions for PDS4 labels.

Other Prolog Elements

The only other prolog components you should ever find in a PDS4 label would be white space (blank lines) and XML comments (delimited by <!-- and -->). Neither of these is required, of course.

In XML document files inside the archive, you may find a Document Type Definition (DTD) declaration. A DTD declaration will open with "<!DOCTYPE", and may consist of a reference to an external definition (perhaps a standard DTD like the DocBook DTD), or a series of type definitions statements for elements, attributes, and all the other sorts of things that PDS uses XML Schema files to define.