XML Schema Regular Expressions - Basics

From The SBN Wiki
Jump to navigation Jump to search

Introduction

The XML Schema language has its own regular expression syntax and specialized notation. It is very similar to the syntax used by Perl, but not sufficiently similar that you can plug in even a simple Perl regular expression and expect it to work as it would in a Perl script. The full syntax available for use in XML Schema 1.1 documents is defined in Appendix G of the W3C Standard: https://www.w3.org/TR/xmlschema11-2/

Data dictionary designers might encounter XML Schema regular expressions if they make use of the <pattern> option in defining their local attributes. Patterns are used by PDS to define some specific data types in the pds: core namespace, so label designers and creators may also encounter them in schema files while investigating validation failures.

What follows is not an exhaustive guide to regular expressions in XML Schema. Rather, it is a quick review of the features most likely to be encountered in the pds: schema or to be useful to dictionary writers. It also assumes you already have familiarity with regular expressions in some other context. More complete descriptions are readily available on the web, should you wish to delve deeper. As of this writing, PDS does not constrain the content of regular expressions you might be creating.

Before You Even Start

Here are a couple things you should be aware of to start:

  • All regular expressions in XML Schema are anchored to both the beginning and ending of the string being compared. So all patterns must match the first character and all characters through to the end of the input string in order to be considered successful. The standard anchor characters (^ and $) will cause failures or syntax errors if you try to use them as anchors.
  • When defining patterns, do not enclose them in delimiters. So if you are trying to define a <pattern> facet for a data type in your local dictionary, do this:
<pattern>[A-Z][a-z]+</pattern>
not this:
<pattern>/[A-Z][a-z]+/</pattern>

The Usual

These points should be familiar from using regular expressions in other contexts:

  • Single characters match themselves literally unless they are special characters.
  • Character classes are defined using '[]'. The hyphen ('-') can be used to indicate a range and the carat ('^') negates the class if it is the first character in the class.
  • The wildcard character '.' matches any single character.
  • The quantifier character '?' means "zero or one time"; '*' means "zero or more times"; '+' means "one or more times".
  • The quantifier expression "{n}" means "exactly n times; "{n,}" means "n or more times"; "{n,m}" means "at least n but not more than m times".
  • Parentheses ('()') can be used to group.
  • A vertical bar indicates alternation - "(a|b)" matches either a or b (and the parentheses are both literal and required)
  • The escape character is '\'.
  • A space character is taken as literal - that is, it matches exactly one space character

In XML Schema

The following conventions are not necessarily specific to XML Schema regular expression syntax, but they are not quite as universal as the preceding conventions.

Character Entities

Character entities, like using "&lt;" for "<" or "&amp;" for "&", can be used to match the character they represent. So if you wanted to make sure your copyright notice field started with a copyright symbol (©) and the year, this is what you would include in your dictionary definition file:

<pattern>&copy; 20[0-9]{2}.*<\pattern>

Specifically, this requires that the value starts with copyright symbol, followed by one space, the characters "20", two more digits, and then zero or more unspecified characters through to the end of the value (remember, XSD anchors all patterns at both ends, so the pattern must match the entire value it is testing). This would match "© 2017" and "© 2020 NASA PDS", but not "Copyright 2010".

Character References

XML character references will also be matched as the character they correspond to. Character references begin with "&#" followed by a decimal number and a semicolon (';'), or "&#x" followed by a hexadecimal number and a semicolon.. The number part corresponds to the Unicode code point for the character. So the character reference for a blank space, for example, is "&#x20;".

Escape Sequences

The sequences match one character of the type indicated:

\n
matches a linefeed character
\r
matches a carriage-return character
\t
matches a horizontal tab character
\d
matches any decimal digit (\D is the negation)
\s
matches any whitespace character (\S is the negation)

All types of quantifiers can be used with these escape sequences - so "\n+" matches one or more newline characters in a row.

There are a few more of these escapes defined by the XML Schema specification to cover cases like characters that might be part of a valid XML name. Check the XML Schema 1.1 standard Appendix G if you're looking for something along these lines.

Block Escapes

Block escapes provide a method for matching against contiguous blocks of Unicode characters. For PDS purposes, there is really only one of these to be concerned with - the BasicLatin block, which spans the ASCII code page. If you want to constrain your values to use only ASCII characters, use a block escape that looks like this:

<pattern>\p{IsBasicLatin}+<\pattern>

Bells and Whistles

  • You are unlikely to need it, but the XML Schema 1.1 regular expression syntax allows you to define a character class as one set of characters minus another set. So if you took an unnatural dislike to the number '4', you could disallow it from your numeric value with a pattern like this:
<pattern>[0-9-[4]]+<\pattern>
  • If you specify multiple patterns when defining a data type, then a value of that type must match only one of those patterns, not all of them. If the value doesn't match any of the patterns, it is flagged as invalid.
  • The XML Schema standard defines some escape categories that match characters based on their defined Unicode characteristics, like "An uppercase letter", or "Numeric digit". These are not generally useful in a PDS context because PDS requires English as the standard language and European digits for labels and documentation - so values with non-English letters and other numeric digits are prohibited by the larger context. (If you have a text field that should allow non-ASCII characters, then declare it as a UTF-8 type rather than an ASCII one, and of course then you might want to expand your escape usage accordingly.)