Difference between revisions of "HHG to the eXtensible Markup Language in PDS4"

From The SBN Wiki
Jump to navigation Jump to search
(Creation - Safety Save)
m (categories)
Line 67: Line 67:
When looking at PDS4 labels, remember that the primary structural elements you are seeing are going to be called ''classes'', ''sub-classes'', and ''attributes''.  When we mean an XML-attribute, like ''role'', we'll say "XML-attribute".
When looking at PDS4 labels, remember that the primary structural elements you are seeing are going to be called ''classes'', ''sub-classes'', and ''attributes''.  When we mean an XML-attribute, like ''role'', we'll say "XML-attribute".

Revision as of 19:56, 3 August 2017

XML is the acronym for the eXtensible Markup Language, developed and published as an official recommendation of the World Wide Web Consortium ("W3C"). Recommendations of the W3C are equivalent to standards from other international bodies, such as ISO and IEEE.

What It Is

XML is a syntax standard intended for use primarily in text files. It provides extremely generic rules for identifying markup within text - that is, tags that indicate something about how the enclosed content should be processed.

If you're familiar with HTML and its tags, then an XML document will look familiar. But unlike HTML, XML is very strict about things like case-matching, order of markup tags, and such. Most HTML is not valid XML.

What It Is Not

XML is not a language itself. It is not code, nor is it a processor. It does not even define the actual markup - it only defines a standard way to differentiate between content and markup.

Basic Requirements

  • Any XML document must start with a processing instruction which looks something like this:
      <?xml version="1.0" encoding="UTF-8"?>
This tells the processor software that this is an XML document, that it follows version 1.0 of the XML standard, and the character set used in the document is going to be UTF-8 (i.e., Unicode).
  • An XML element consists of an opening tag, perhaps some content, and a closing tag.
  • An opening tag has the syntax <tag-name>, where tag-name is composed of letters, digits, and other printing characters as defined by the XML standard. The name must begin with a letter. It may not contain whitespace of any kind. Tag names are always case-sensitive.
  • An opening tag may also have attributes following the tag name. Attributes have the form att-name="att-value". Attribute values must always be quoted, though you can use either single or double quotes.
  • A closing tag has the form </tag-name>. Closing tags never have attributes, even if the opening tag did.
  • If a tag has no content, the opening and closing tag can be combined using the shorthand notation <tag-name/>.
  • Tags may be nested but they may not be interleaved. <bold><italic>Hello, world.</italic></bold> is valid; <bold><italic>Hello, world.</bold></italic> is not.
  • Note that the XML standard does not define tag names - it only limits the character set and defines how to tell tags from content.
  • The XML comment marker begins with &lt!-- and ends with -->. Everything in between is to be ignored by processors. Comments can start anywhere outside a tag where whitespace would be valid; they extend over line breaks, but they cannot be nested.

An Example and Some Terminology

Here is a piece of an XML file containing film descriptions:

    <title>Bedtime for Bonzo</title>
    <director>Frederick de Cordova</director>
    <screenplayBy>Lou Breslow</screenplayBy>
    <screenplayBy>Val Burton</screenplayBy>
    <storyBy>Ted Berkman</storyBy>
    <storyBy>Raphael Blau</storyBy>
    <starring role="Male Lead">Ronald Reagan</starring>
    <starring role="Female Lead">Diana Lynn</starring>

In the example above, there are seven tags:


The <starring> tag also has one attribute, role, used to indicate the leading man and leading lady.

Everything between the opening and closing tags (but not including those tags) is considered content. The content may include other tags and their content. For example, the content of the <title> tag is the string "Bedtime for Bonzo". The content of the <movie> tag and its end tag include nine sets of tags and also the content of those tags.

The opening tag (with attributes, if any), the closing tag, and all the content in between constitute an element of the XML document. So the <movie> elements contains nine other elements. The first <starring> element contains the string "Ronald Reagan" as well as the attribute role and its value string, "Male Lead".

PDS4 Terminology and Usage

PDS4 uses XML to represent logical structures from the PDS4 information model. So while all elements are equal to an XML parser, when you are dealing with a PDS label you will encounter information model-based terminology.

Specifically, in PDS4 labels, and XML element that includes other XML elements in its content, as <movie*gt; does above, is called a class. An XML element that does not include other elements in its content (like <title> in the example above) is called an attribute. PDS rarely uses XML attributes, like role="Male Lead" for the <starring> element, but there is still potential for confusion.

When looking at PDS4 labels, remember that the primary structural elements you are seeing are going to be called classes, sub-classes, and attributes. When we mean an XML-attribute, like role, we'll say "XML-attribute".