Understanding XML Catalog Files

From The SBN Wiki
Revision as of 16:19, 20 February 2013 by Raugh (talk | contribs)
Jump to navigation Jump to search

XML catalog files use some terminology that can be fairly opaque to those new to XML. Following is an explanation of the key terms used in the XML Catalog standard and their relevance to the PDS4 context.

Identifiers: Public vs. System

The concepts of public identifier and system identifier predate XML.  Both concepts were key in the pre-OASIS world of SGML (the ancestor of XML). These identifiers allowed a document author to reference a file external to his own document.  Typically, this would be a Document Type Definition (DTD). DTDs predate schemas, but did the same sort of job - defining the valid content of an SGML file.  Standard DTDs were developed to provide interoperability between systems.  Perhaps the most widely-known DTD is the DTD that defines the DocBook documentation system.

At this stage of the game, the distinction between public and system identifiers was clear and simple: The public identifier was a globally unique, permanent and invariant identifier assigned to a resource, like the DocBook DTD.  The format of the public identifier was defined as part of the ISO 8879 (SGML) standard as the Formal Public Identifiers (FPIs) format, and there was (presumably still is) at least one registration authority to assign namespaces to insure that unique FPIs can be formulated by diverse organizations. So the public identifier was clearly a logical identification of a resource.  

In this regime, the system identifier was always a physical location - a reference to a file on disk, for example.  The SGML standard required that at least one of the two identifiers was present, but did not require both.

Enter Catalog Files

At this point, the public identifier was a logical reference that could not easily be resolved, but at least it was transportable, unlike the system identifier.  To address this problem, the SGML Open project, which eventually became OASIS, developed the first catalog-type standard (OASIS Technical Resolution 9401:1997) to map public identifiers to system identifiers in an external ("catalog") file, which could be referenced by applications.

Now, in this pre-XML world, this was a pretty straightforward task.  The public identifier was always a logical reference, and the system identifier was always a physical reference to a locally accessible file.  So a DocBook author, for example, could include both types of identifier in his source files as he was preparing them, and when he sent them out into the world the receivers could set their applications to ignore the system identifiers in the document and instead translate the public identifiers using their own catalog files.  In other words, the application could choose whether the public or system identifiers should be "preferred" - a term that will come back later with much reduced significance for XML.

Time Passes...

SGML begat XML, the SGML Open group became OASIS Open, and URIs have largely supplanted FPIs.  In XML documents, the public identifier is optional, while the system identifier is usually required (to identify things like name spaces and import files).  But in XML, these references are also required to be URIs, which are themselves logical pointers.  So in the XML regime, the system identifier does not point to a physical location.

OK, it might point to a physical location - some URIs do.  But in general URIs are not required to be resolvable in themselves, so you can't count on someone else's URI being directly resolvable to a physical file. Which is why XML documents may include schemaLocation attributes - to indicate the physical location of the files needed to define name spaces or to be imported into the current document.

XML Catalog Standard

So OASIS rolled up its sleeves and beefed up the early mapping standard to become the XML Catalog 1.0 standard, to address both SGML and XML mapping needs.  The catalog file maps the values of public identifiers, system identifiers and URIs generally to (other) URIs that actually do resolve to a physical file.  It will do this for anything your application considers to be an external id (either a public identifier or a system identifier), as well as for any other URIs it encounters .  A few things to keep in mind when reading/writing catalog entries:

  • The XML Catalog standard explicitly states that the first matching line is the one applied - anything else will be ignored.  So when you're writing your translation elements, put the most specific matches first, and the more general matches later.  For example, if you're trying to match a URI that ends in a file name, put that element before any element that matches just the path.
  • Applications can choose to be more or less picky about URI formatting in your catalog files.  According to the XML Catalog standard, your URIs should be URI-encoded, meaning that certain characters (like blanks) must be escaped and encoded, and protocols should use proper syntax for directory referencing.  The oXygen editor, for example, is fairly lenient about URI formatting in the catalog file.  Other applications may not be so forgiving.
  • One of the consequences of the evolution from DTD and external (public/system) identifiers to XML and URIs is that the distinction between public and system identifiers is largely moot.  The external identifiers in our PDS XML documents - the references to the XML Schema and XML Schema-Instance name spaces, for example - are not required to have system identifiers (the definitions are "built-in", as it were). Since everything else falls under the "URI" rubric, our XML Catalog files tend to contain only URI-type mappings.
  • As a result, applications may be more or less lenient about discriminating between public/system identifiers and general URIs when matching strings and applying mappings.  So for some applications, using a system identifier mapping rather than a URI mapping will still translate all occurences of the matching URI, even if it technically isn't being used as a system identifier.  In other applications, your mileage may vary.
  • It is possible to write very complex catalog files, with elements for including additional files or branching from one catalog file to another. Most PDS data preparers and users don't need any of those complications.  The standard set-up and a few simple URI mapping parameters will do the job for most of us.
  • Catalog files are not transportable.  They are the epitome of environment-specific configuration. So when following someone else's example, be particularly careful about the file specification URIs you'll be translating to - they will depend critically on your local file system.

XML Catalog File Elements

So here's what you need to know to write or edit an XML Catalog file.

Every catalog file will begin with the usual <?xml> tag and a <!DOCTYPE> declaration, followed by the <catalog> tag which begins the catalog information proper and identifies the namespace associated with the XML Catalog standard.  These can be copied verbatim from any valid catalog file; if you use an XML Catalog generation tool, these will be provided for you. The <catalog> tag may have a prefer attribute with a value of either "public" or "system".  As explained above, for PDS purposes this preference setting is meaningless - we'll only be mapping URIs, not external identifiers.

Between the <catalog> and </catalog> tags, these are the tags that will likely be most useful and most common in catalog files supporting PDS labels:

<uri name="name_string" uri="physical_reference"/>

The <uri> element does a straight one-to-one mapping from the URI indicated by name to the URI given by uri.  So name_string is what appears in the XML file, and physical_reference is the actual location of the file that contains the answer (the namespace definition, the XML fragment to be included, etc.). This must be resolvable. For most of our users this will resolve to a file on the local file system, so it will begin with the string "file:///". It could also resolve to a web location if that's the way you roll, in which case it will likely begin with something like "http:" or "ftp:".  The URIs should both be URI-encoded, for safety.

<rewriteURI uriStartString="old_prefix" rewritePrefix="new_prefix"/>

The <rewriteURI> element can be used to map many URIs at once, based on a common initial substring in those URIs. For example, say you have reproduced the PDS schema directories in a local repository.  You could then map all your PDS namespace references at once by replacing the "http://pds.nasa.gov/pds" part of every namespace URI with a reference to the root directory of your schema repository. As with the <uri> element, the URI created must be resolvable.  Old_prefix is the prefix as it appears in the XML file; new_prefix is the replacement that turns that into a resolvable reference.

<uriSuffix uriSuffix="uri_suffix" uri="physical_reference"/>

The <uriSuffix> element matches based on the end of the URI string - so if the URI in the XML document ends in uri_suffix, then the entire URI is mapped to the physical_reference (which must, of course, be resolvable).  Note that this is not at all like <rewriteURI>, which effectively does a string substitution on the URI from the XML document. <uriSuffix> matches based on the suffix only, but then expects to map this to a complete, new URI. (One of the few differences between the XML Catalog 1.0 and 1.1 standards is the addition of this element in the 1.1 standard.)

<delegateURI uriStartString="prefix_string" catalog="physical_reference"/>

The <delegateURI> element lets you hand off URI translation for a set of URIs to a different catalog file.  This can be useful if you're working in a fairly complex environment where some of your URI translations are stable and some aren't (or some are in production mode and others in development). This could also be used to set up a hierarchy of public and private XML catalogs. When a URI in the XML document starts with the prefix_string, the URI will be immediately handed off to the catalog file indicated by the physical_reference for processing.(Note, though, that the catalog processing will stop at the first match encountered, so take care with where you locate your delegate element.)

There are analogous elements to the above for mapping public identifiers and system identifiers, as well as a <group> element for providing default preferences and base URIs for these elements, and a <nextCatalog> element for explicitly passing control to another catalog file (rather than letting your application work through a predefined list).  In addition, all the elements listed above will take an xml:base attribute to specify a base URI, so that relative URIs can be turned into absolute URIs.  For most PDS uses, where all required URIs are also required to be absolute and the public/system preference is not applicable, these are not necessary.  If you think you might need or want them, read the standard carefully and have at it.


Here are some links to the various standards mentioned above:


This page was originally written by Anne Raugh for the JPL OODT Wiki PDS4 pages, which are behind the JPL firewall and accessible only by account and password. She copied them here to make them generally accessible.