XML Primer for PDS4

From The SBN Wiki
Revision as of 18:20, 5 November 2012 by Raugh (talk | contribs) (Creation)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page lists the XML standards applicable to PDS4 labels and processing. It provides links to the standards themselves, and some brief overview points about each.


XML

XML has two official (that is, W3C Recommendation) versions:

Overview

The XML Standard defines the overall syntax for XML files, which are called documents by the XML Standard - that is, PDS4 labels are XML documents. The XML Standard covers the following topics relevant to PDS4 labels:

  • Syntax for elements and attributes - the used of '<' and '>' around tag names; the requirement for closing tags; comment and processing instruction format; etc.
  • Character set - Allowed characters for names and content (with reference to the Unicode standard)
  • The required XML declaration that must appear as the first line of any XML document.

It also includes syntax for writing Document Type Definitions (DTDs), which are used to define the element and attribute names and format constraints. In PDS4, however, we will be using XML Schema to do that rather than DTDs.

Version 1.0 vs. Version 1.1

For most purposes within the PDS, the distinction betweem version 1.0 and 1.1 of the XML standard can be ignored. The major differences are:

  • 1.1 expands the allowed character set for names to accommodate expansion of the Unicode standard since version 1.0.
  • 1.1 has looser character constraints on names, in anticipation of future expansion of the Unicode standard. Where version 1.0 prohibited everything that wasn't explicitly allowed, version 1.0 allows anything that is explicitly prohibited.
  • 1.1 expands line-end conventions to include Unicode conventions (and a couple others)
  • 1.1 defines "full normalization" constraints. These only come into play in the US when working with documents converted from word processing environments where typographic ligatures or letters with diacritical marks might be transcoded as either a single Unicode character or a sequence of the individual characters that must be used to compose the final character. If you don't know what that means and you want to, try this Wikipedia article on "Unicode equivalence" as a starting point.

While that last point is not likely to come up in a PDS4 label context in the US, it may be relevant to international organizations looking to adopt PDS4 standards and tools for local use. In this case, it may well be worth the effort to ensure that all software used is working to the XML 1.1 standard.