Difference between revisions of "HHG to the eXtensible Markup Language in PDS4"

From The SBN Wiki
Jump to navigation Jump to search
m
 
Line 11: Line 11:
 
XML is ''not'' a language itself.  It is not code, nor is it a processor.  It does not even define the actual markup - it only defines a standard way to differentiate between content and markup.
 
XML is ''not'' a language itself.  It is not code, nor is it a processor.  It does not even define the actual markup - it only defines a standard way to differentiate between content and markup.
  
=== Basic Requirements ===
+
=== The Basics ===
  
 
* Any XML document must start with a processing instruction which looks something like this:
 
* Any XML document must start with a processing instruction which looks something like this:
Line 17: Line 17:
 
       <?xml version="1.0" encoding="UTF-8"?>
 
       <?xml version="1.0" encoding="UTF-8"?>
 
</pre>
 
</pre>
:This tells the processor software that this is an XML document, that it follows version 1.0 of the XML standard, and the character set used in the document is going to be UTF-8 (i.e., Unicode).
+
:This tells the processor software that this is an XML document, that it follows version 1.0 of the XML standard, and the character set used in the document is going to be UTF-8 (i.e., Unicode in the UTF-8 encoding).
 
* An ''XML element'' consists of  an opening tag, perhaps some content, and a closing tag.   
 
* An ''XML element'' consists of  an opening tag, perhaps some content, and a closing tag.   
 
* An opening tag has the syntax '''&lt;'''''tag-name'''''&gt;''', where ''tag-name'' is composed of letters, digits, and other printing characters as defined by the XML standard.  The name must begin with a letter.  It may not contain whitespace of any kind.  Tag names are always case-sensitive.
 
* An opening tag has the syntax '''&lt;'''''tag-name'''''&gt;''', where ''tag-name'' is composed of letters, digits, and other printing characters as defined by the XML standard.  The name must begin with a letter.  It may not contain whitespace of any kind.  Tag names are always case-sensitive.
Line 26: Line 26:
  
 
* Note that the XML standard does ''not'' define tag names - it only limits the character set and defines how to tell tags from content.
 
* Note that the XML standard does ''not'' define tag names - it only limits the character set and defines how to tell tags from content.
* The XML comment marker begins with ''&lt!--'' and ends with ''--&gt;''. Everything in between is to be ignored by processors.  Comments can start anywhere outside a tag where whitespace would be valid; they extend over line breaks, but they cannot be nested.
+
* The XML comment marker begins with ''&lt;!--'' and ends with ''--&gt;''. Everything in between is to be ignored by processors.  Comments can start anywhere outside a tag where whitespace would be valid; they extend over line breaks, but they cannot be nested.
 
   
 
   
 
=== An Example and Some Terminology ===
 
=== An Example and Some Terminology ===
Line 63: Line 63:
 
''Main article: [[XML Primer for PDS4]]''
 
''Main article: [[XML Primer for PDS4]]''
  
PDS4 uses XML to represent logical structures from the PDS4 information model. So while all elements are equal to an XML parser, when you are dealing with a PDS label you will encounter information model-based terminology.
+
PDS4 uses XML to represent logical structures from the PDS4 information model. So while all elements are equal to an XML parser, when you are dealing with a PDS label you will encounter information-model-based terminology.  It is ''extremely'' easy to confuse the two, because there is overlap where the same term means different things in each context.
  
Specifically, in PDS4 labels, and XML element that includes other XML elements in its content, as &lt;movie*gt; does above, is called a ''class''. An XML element that does not include other elements in its content (like &lt;title&gt; in the example above) is called an ''attribute''.  PDS rarely uses XML attributes, like ''role="Male Lead"'' for the &lt;starring&gt; element, but there is still potential for confusion.  
+
Specifically, in PDS4 labels an XML element that includes other XML elements in its content, as &lt;movie&gt; does above, is called a ''class''. An XML element that does not include other elements in its content (like &lt;title&gt; in the example above) is called an ''attribute''.  PDS rarely uses XML-attributes, like ''role="Male Lead"'' for the &lt;starring&gt; element, but there is still potential for confusion.  
  
 
When looking at PDS4 labels, remember that the primary structural elements you are seeing are going to be called ''classes'', ''sub-classes'', and ''attributes''.  When we mean an XML-attribute, like ''role'', we'll say "XML-attribute".
 
When looking at PDS4 labels, remember that the primary structural elements you are seeing are going to be called ''classes'', ''sub-classes'', and ''attributes''.  When we mean an XML-attribute, like ''role'', we'll say "XML-attribute".

Latest revision as of 22:16, 21 June 2018

XML is the acronym for the eXtensible Markup Language, developed and published as an official recommendation of the World Wide Web Consortium ("W3C"). Recommendations of the W3C are equivalent to standards from other international bodies, such as ISO and IEEE.

What It Is

XML is a syntax standard intended for use primarily in text files. It provides extremely generic rules for identifying markup within text - that is, tags that indicate something about how the enclosed content should be processed.

If you're familiar with HTML and its tags, then an XML document will look familiar. But unlike HTML, XML is very strict about things like case-matching, order of markup tags, and such. Most HTML is not valid XML.

What It Is Not

XML is not a language itself. It is not code, nor is it a processor. It does not even define the actual markup - it only defines a standard way to differentiate between content and markup.

The Basics

  • Any XML document must start with a processing instruction which looks something like this:
      <?xml version="1.0" encoding="UTF-8"?>
This tells the processor software that this is an XML document, that it follows version 1.0 of the XML standard, and the character set used in the document is going to be UTF-8 (i.e., Unicode in the UTF-8 encoding).
  • An XML element consists of an opening tag, perhaps some content, and a closing tag.
  • An opening tag has the syntax <tag-name>, where tag-name is composed of letters, digits, and other printing characters as defined by the XML standard. The name must begin with a letter. It may not contain whitespace of any kind. Tag names are always case-sensitive.
  • An opening tag may also have attributes following the tag name. Attributes have the form att-name="att-value". Attribute values must always be quoted, though you can use either single or double quotes.
  • A closing tag has the form </tag-name>. Closing tags never have attributes, even if the opening tag did.
  • If a tag has no content, the opening and closing tag can be combined using the shorthand notation <tag-name/>.
  • Tags may be nested but they may not be interleaved. <bold><italic>Hello, world.</italic></bold> is valid; <bold><italic>Hello, world.</bold></italic> is not.
  • Note that the XML standard does not define tag names - it only limits the character set and defines how to tell tags from content.
  • The XML comment marker begins with <!-- and ends with -->. Everything in between is to be ignored by processors. Comments can start anywhere outside a tag where whitespace would be valid; they extend over line breaks, but they cannot be nested.

An Example and Some Terminology

Here is a piece of an XML file containing film descriptions:

  <movie>
    <title>Bedtime for Bonzo</title>
    <firstRelease>1951</firstRelease>
    <director>Frederick de Cordova</director>
    <screenplayBy>Lou Breslow</screenplayBy>
    <screenplayBy>Val Burton</screenplayBy>
    <storyBy>Ted Berkman</storyBy>
    <storyBy>Raphael Blau</storyBy>
    <starring role="Male Lead">Ronald Reagan</starring>
    <starring role="Female Lead">Diana Lynn</starring>
  </movie>

In the example above, there are seven tags:

<movie>
<title>
<firstRelease>
<director>
<screenplayBy>
<storyBy>
<starring>

The <starring> tag also has one attribute, role, used to indicate the leading man and leading lady.

Everything between the opening and closing tags (but not including those tags) is considered content. The content may include other tags and their content. For example, the content of the <title> tag is the string "Bedtime for Bonzo". The content of the <movie> tag and its end tag include nine sets of tags and also the content of those tags.

The opening tag (with attributes, if any), the closing tag, and all the content in between constitute an element of the XML document. So the <movie> elements contains nine other elements. The first <starring> element contains the string "Ronald Reagan" as well as the attribute role and its value string, "Male Lead".

PDS4 Terminology and Usage

Main article: XML Primer for PDS4

PDS4 uses XML to represent logical structures from the PDS4 information model. So while all elements are equal to an XML parser, when you are dealing with a PDS label you will encounter information-model-based terminology. It is extremely easy to confuse the two, because there is overlap where the same term means different things in each context.

Specifically, in PDS4 labels an XML element that includes other XML elements in its content, as <movie> does above, is called a class. An XML element that does not include other elements in its content (like <title> in the example above) is called an attribute. PDS rarely uses XML-attributes, like role="Male Lead" for the <starring> element, but there is still potential for confusion.

When looking at PDS4 labels, remember that the primary structural elements you are seeing are going to be called classes, sub-classes, and attributes. When we mean an XML-attribute, like role, we'll say "XML-attribute".