XML Primer for PDS4

From The SBN Wiki
Revision as of 19:01, 5 November 2012 by Raugh (talk | contribs) (Added XML Schema)
Jump to navigation Jump to search

This page lists the XML standards applicable to PDS4 labels and processing. It provides links to the standards themselves, and some brief overview points about each.


XML

XML has two official (that is, W3C Recommendation) versions:

Overview

The XML Standard defines the overall syntax for XML files, which are called documents by the XML Standard - that is, PDS4 labels are XML documents. The XML Standard covers the following topics relevant to PDS4 labels:

  • Syntax for elements and attributes - the used of '<' and '>' around tag names; the requirement for closing tags; comment and processing instruction format; etc.
  • Character set - Allowed characters for names and content (with reference to the Unicode standard)
  • The required XML declaration that must appear as the first line of any XML document.

It also includes syntax for writing Document Type Definitions (DTDs), which are used to define the element and attribute names and format constraints. In PDS4, however, we will be using XML Schema to do that rather than DTDs.

The XML standard does not define the element names themselves. Any application or set of applications planning to make use of XML must either define its own set of element names and associated data types, or use one of the publicly defined systems (like DocBook, which is an XML mark-up language used for creating books and articles).

Version 1.0 vs. Version 1.1

For most purposes within the PDS, the distinction betweem version 1.0 and 1.1 of the XML standard can be ignored. The major differences are:

  • 1.1 expands the allowed character set for names to accommodate expansion of the Unicode standard since version 1.0.
  • 1.1 has looser character constraints on names, in anticipation of future expansion of the Unicode standard. Where version 1.0 prohibited everything that wasn't explicitly allowed, version 1.0 allows anything that is explicitly prohibited.
  • 1.1 expands line-end conventions to include Unicode conventions (and a couple others)
  • 1.1 defines "full normalization" constraints. These only come into play in the US when working with documents converted from word processing environments where typographic ligatures or letters with diacritical marks might be transcoded as either a single Unicode character or a sequence of the individual characters that must be used to compose the final character. If you don't know what that means and you want to, try this Wikipedia article on "Unicode equivalence" as a starting point.

While that last point is not likely to come up in a PDS4 label context in the US, it may be relevant to international organizations looking to adopt PDS4 standards and tools for local use. In this case, it may well be worth the effort to ensure that all software used is working to the XML 1.1 standard.


XML Schema

The XML Schema Definition Language (XSD) has two official (W3C Recommendation) versions, each of which comes in two parts:

The 1.0 version also has an associated Primer, which should be largely applicable to both versions.


Overview

XML Schema Definition Language (XSD) descibes an XML language that can be used to define elements, attributes, and content constraints for a set of XML documents. It effectively defines a new XML "language" by defining element and attribute names and making constraints and requirements on content and usage.

XSD is used by PDS to define the structure of PDS4 labels and enforce content requirements. If you are creating or editing PDS4 labels, you will need to know how to reference XSD files from the labels; how to use the XSD files to validate your labels; and at some point you'll probably want to know how to get structural information out of XSD files so you can see what you can optionally include in your labels. You will probably not have to write your own schema files, unless you really want to.

Version 1.0 vs. Version 1.1

Most of the differences between XSD 1.0 and XSD 1.1 would not be visible to an end-user who is simple referencing schemas for writing and validating labels. Because of a number of improvements to validation processing, if you have the option of using XSD 1.1 validation (it might be called "XML Schema 1.1" in your software), it's probably a good idea to use it.

If you are going to be writing or editing XSD files directly, XSD 1.1 added the ability to include assertion clauses using XPath syntax directly to your XSD files. This can be useful in a PDS context if, for some reason, you are writing a data dictionary from scratch and want to avoid having to write a separate Schematron file just to enforce a couple of trivial co-existence constraints. As of this writing, this seems very unlikely to happen.