XML Primer for PDS4

From The SBN Wiki
Revision as of 19:29, 3 August 2017 by Akash (talk | contribs) (categories)
Jump to navigation Jump to search

This page lists the XML standards applicable to PDS4 labels and processing. It provides links to the standards themselves, and some brief overview points about each.

The XML development effort has a design philosophy that is modular in approach. Standards are developed to be useful across contexts, and when a part of a larger standard appears to have broader application than the original context, it is split off into a separate standard for development, so that later work can reference an existing standard and avoid re-defining the wheel. That is why, for example, the Namespaces in XML standard is not just part of the XML 1.0 standard - because the concept of namespace is also applicable in schema files, catalog files, and other types applications.

As a side-effect, though, it seems like you have to know half a dozen different standards to get anything done in XML - thus this summary page.


XML

XML has two official (that is, W3C Recommendation) versions:

Overview

The XML Standard defines the overall syntax for XML files, which are called documents by the XML Standard - that is, PDS4 labels are XML documents. The XML Standard covers the following topics relevant to PDS4 labels:

  • Syntax for elements and attributes - the used of '<' and '>' around tag names; the requirement for closing tags; comment and processing instruction format; etc.
  • Character set - Allowed characters for names and content (with reference to the Unicode standard)
  • The required XML declaration that must appear as the first line of any XML document.

It also includes syntax for writing Document Type Definitions (DTDs), which are used to define the element and attribute names and format constraints. In PDS4, however, we will be using XML Schema to do that rather than DTDs.

The XML standard does not define the element names themselves. Any application or set of applications planning to make use of XML must either define its own set of element names and associated data types, or use one of the publicly defined systems (like DocBook, which is an XML mark-up language used for creating books and articles).

Version 1.0 vs. Version 1.1

For most purposes within the PDS, the distinction between version 1.0 and 1.1 of the XML standard can be ignored. The major differences are:

  • 1.1 expands the allowed character set for names to accommodate expansion of the Unicode standard since version 1.0.
  • 1.1 has looser character constraints on names, in anticipation of future expansion of the Unicode standard. Where version 1.0 prohibited everything that wasn't explicitly allowed, version 1.1 allows anything that is not explicitly prohibited.
  • 1.1 expands line-end conventions to include Unicode conventions (and a couple others)
  • 1.1 defines "full normalization" constraints. These only come into play in the US when working with documents converted from word processing environments where typographic ligatures or letters with diacritical marks might be transcoded as either a single Unicode character or a sequence of the individual characters that must be used to compose the final character. If you don't know what that means and you want to, try this Wikipedia article on "Unicode equivalence" as a starting point.

While that last point is not likely to come up in a PDS4 label context in the US, it may be relevant to international organizations looking to adopt PDS4 standards and tools for local use. In this case, it may well be worth the effort to ensure that all software used is working to the XML 1.1 standard.


XML Schema

The XML Schema Definition Language (XSD) has two official (W3C Recommendation) versions, each of which comes in two parts:

The 1.0 version also has an associated Primer, which should be largely applicable to both versions.


Overview

XML Schema Definition Language (XSD) descibes an XML language that can be used to define elements, attributes, and content constraints for a set of XML documents. With XSD you can effectively create a new XML-based "language" by defining element and attribute names and making constraints and requirements on content and usage.

XSD is used by PDS to define the structure of PDS4 labels and enforce content requirements. If you are creating or editing PDS4 labels, you will need to know how to reference XSD files from the labels; how to use the XSD files to validate your labels; and at some point you'll probably want to know how to get structural information out of XSD files so you can see what you can optionally include in your labels. You will probably not have to write your own schema files, unless you really want to.

Version 1.0 vs. Version 1.1

Most of the differences between XSD 1.0 and XSD 1.1 would not be visible to an end-user who is simple referencing schemas for writing and validating labels. Because of a number of improvements to validation processing, if you have the option of using XSD 1.1 validation (it might be called "XML Schema 1.1" in your software), it's probably a good idea to use it.

If you are going to be writing or editing XSD files directly, XSD 1.1 added the ability to include assertion clauses using XPath syntax directly to your XSD files. This can be useful in a PDS context if, for some reason, you are writing a data dictionary from scratch and want to avoid having to write a separate Schematron file just to enforce a couple of trivial co-existence constraints. As of this writing, this seems very unlikely to happen.


Namespaces in XML

The full W3C Recommendation is available here:

Overview

The namespaces standard defines how namespaces are referenced within XML documents. This is the standard that reserves the "xmlns" attribute for defining short-hand prefixes for namespaces within XML documents, It also specifies that namespace identifiers must follow the Internationalized Resource Identifiers (IRI) format. PDS uses Uniform Resource Locators (URLs - a subset of IRIs) for namespace identifiers within the PDS system.

Version 1.0 vs. Version 1.1

There is an earlier version of the namespace standard, but for PDS purposes the differences are not significant, Apart from errata, the substantive changes from 1.0 to 1.1 are:

  • Version 1.1 provides a way to un-declare prefixes.
  • Version 1.1 defines namespace names as being IRIs, rather than URIs.

PDS applications are unlikely to ever be so complex as to require un-defining namespace prefixes, and PDS has made a policy decision to use URIs rather than IRIs (the character set constraints are tighter for URIs than IRIs). So Version 1.0 of the namespace standard should be completely sufficient for PDS work.



Other X-standards

In keeping with the modular standards concept, there are other XML standards that are being used and that you might hear referenced, but which you may not have to worry about unless and until you are doing some detailed PDS4 development or validation work that requires them. These include:

XML Catalog
This W3C recommendation defines a special catalog file format for use in translating logical references, like URIs, to physical locations. It also defines the method for resolving references according to the information in the catalog file. If you do any serious work with PDS4 labels, you will likely set up an XML catalog file to resolve schema references. You can read the "Understanding XML Catalog Files" page on this wiki for some history, some explanation, and some specific advice on the likely most useful parts of the standard for PDS work.
XPath
This W3C recommendation provides syntax for selecting specific tags ("nodes" in the XML-speak of the standard) either by their syntactic relationship to other parts of the document or by their values. This syntax is used in Schematron files to find and test various PDS attributes for conditions that are difficult or impossible to dictate via XSD definitions, and also for defining and validating enumerated value lists for attributes that are restricted to specific values. If you're planning to write your own Schematron rules, you'll need to become familiar with this standard.
XInclude
This W3C recommendation formally defines a way to include the contents of an external file into an XML file being processed (that is, the traditional programmers "include" concept). This will probably come up in PDS4 eventually, but not for the early builds.

In all these cases, your friendly, neighborhood PDS node consultant should be able to provide you with examples or templates and additional advice.