Some Things You Should Know About XML Before You Start

From The SBN Wiki
Revision as of 11:16, 23 July 2014 by Raugh (talk | contribs) (→‎XML Syntax: Added case section)
Jump to navigation Jump to search

Following are some things that you should be aware of before you embark on creating your first PDS4 XML label. This is especially true if you're coming from the PDS3 world, where things like case and keyword order were largely irrelevant; or the HTML world, where closing tags can be treated fairly cavalierly.

XML Syntax

Case Sensitivity

XML is case-sensitive: <Begin>, <begin>, and <BEGIN> are all different tags to an XML parser. Notwithstanding, if you are defining new tags (as you do when you create a local data dictionary) you should never use case alone to distinguish two tags.

XML Tags Must Be Closed

Unlike HTML, all XML tags must be closed. So this is not valid:


XML Tags Must Be Closed in Order

All XML tags must be opened and closed in strict Last Opened - First Closed order. That is, all tags opened inside one tag must be closed before you close the outside tag. So this is valid:

<em>This is <strong>OK</strong></em>

But this is not:

<em>This is <strong>NOT VALID</em></strong>

Character Restrictions

You may not use the greater than (>), less than (<), or ampersand (&) characters in your text fields - an XML parser will always assume these begin a tag or an entity reference (a stand-in for a character that is not available for one reason or another). Instead, you must use the entity references for these characters:

Use "&lt;" for the '<' character.
Use "&gt;" for the '>' character.
Use "&amp;" for the '&' characer.

You must make these substitutions all the time in every text field where you want to use these characters. So, for example, in a table field description of a PDS4 label you might see something like this:

This field is set to "-999" when the observed counts are &gt; 10000.

Because this will cause a syntax error:

This field is set to "-999" when the observed counts are > 10000.

When you are writing code to deal with XML text fields, you will need to remember to decode the entity references before proceeding.

End-of-Line Characters

XML does not require a specific form of line break, so you can use whatever is convenient (carriage return, linefeed, or a combination) when creating an XML file. XML parsers will do the right thing largely because they're parsing on tags, not records - so whatever line break you're using is just whitespace to XML.

When writing code to process XML files, if you are using a conformant XML parser all line breaks will be normalized to linefeed characters (unless you specifically prevent this). If you're not using a conformant parser, you'll need to read the documentation to determine what whitespace processing it does on end-of-line characters, if any. In any event, you will need to worry about appropriate output carriage control for those tags that should preserve whitespace in their values (mainly description, note, and comment fields), and modify the line breaks accordingly when that matters.

XML Schema (XSD)

XML Schema is Strictly Ordered

The XML Schema definition language (XSD) is strictly ordered. That is, attributes and classes must appear in the order in which they are defined in the XSD file. While it is possible to circumvent this, it is difficult and it can have a serious negative impact on validation. So unless otherwise indicated, you should assume that you must put classes and attributes in the order illustrated.

Note that while schema-aware editors can tell you whether any particular class or attribute is a valid choice, they tend to sort the options alphabetically - so it can be very difficult to guess which order attributes should be in for a large class. If you get an error message that an attribute is not valid at a particular place when you know the attribute does belong in the class, then it is almost certainly an ordering error. Check the XSD or the PDS4 Information Model for correct ordering.