Difference between revisions of "PDS4 Character Data Type Definitions"

From The SBN Wiki
Jump to navigation Jump to search
(Safety Save)
Line 3: Line 3:
 
Definitions for binary (hardware) formats used in data files are on the [[PDS4 Binary Data Type Definitions]] page.
 
Definitions for binary (hardware) formats used in data files are on the [[PDS4 Binary Data Type Definitions]] page.
  
'''Last update:''' ''2015-05-18, A.C.Raugh''; Master Schema version 1.4.0.0
+
'''Last update:''' ''2020-07-22, A.C.Raugh''; Master Schema version 1.13.0.0
  
 
== ASCII Representations ==
 
== ASCII Representations ==
Line 86: Line 86:
  
 
==== ASCII_Real ====
 
==== ASCII_Real ====
: This data type is a synonym for the XML Schema type ''xs:double''.  It accepts values representable in a 64-bit IEEE754 floating point format.  It includes simple floating point values as well as exponential notation (i.e., powers of 10), as well as the special constants ''INF'' for positive infinity, ''-INF'' for negative infinity, and ''NaN'' for "Not a Number".  Case counts for these special values.
+
: This data type is a synonym for the XML Schema type ''xs:double''.  It accepts values representable in a 64-bit IEEE754 floating point format.  It includes simple floating point values as well as exponential notation (i.e., powers of 10). It will '''''not''''' accept the special constants ''INF'' for positive infinity, ''-INF'' for negative infinity, or ''NaN'' for "Not a Number".
: '''Usage Note:''' The special constants for +/- infinity and NaN ''should not appear'' in archival data - either in labels or in data tables. In labels, declare attributes as nil or omit them entirely; in data tables, define a numeric constant to use as a flag for missing data.
+
: '''Usage Note:''' For data in archival table products, use the ''<Special_Constants>'' class to define flag values for various conditions found in the data.
 +
 
 +
==== ASCII_Short_String_Collapsed ====
 +
: Use this data type when defining local dictionary attributes in which you want white space in the value to be normalized (leading/trailing whitespace removed, all other runs of whitespace collapsed to a single blank character) by applications that read the metadata. Normalized values must be less than 256 bytes long. This is the data type used for most text-valued attributes that do not required long, free-form text. This data type has no application outside of dictionary creation; use '''ASCII_String''' for text fields in table data.
  
 
==== ASCII_String ====
 
==== ASCII_String ====
 
: This data type is based on the XML Schema type ''xs:token'' constrained to the ASCII character set, and corresponds to a non-empty string of ASCII characters (which may include whitespace) of unlimited length. Whitespace should be collapsed on input.  This data type is used for describing fields in character tables.  
 
: This data type is based on the XML Schema type ''xs:token'' constrained to the ASCII character set, and corresponds to a non-empty string of ASCII characters (which may include whitespace) of unlimited length. Whitespace should be collapsed on input.  This data type is used for describing fields in character tables.  
 +
 +
==== ASCII_Text_Preserved ====
 +
: Use this data type when defining local dictionary attributes in which you want the original line breaks and spacing to be preserved in applications that read the metadata. Typically this is only desirable in free-format text fields of some length, where paragraph breaks and indenting are needed for readability. This data type has no application outside of dictionary creation; use '''ASCII_String''' for text fields in table data.
  
 
==== ASCII_Time ====
 
==== ASCII_Time ====
Line 100: Line 106:
  
 
== UTF-8 Representations ==
 
== UTF-8 Representations ==
 +
 +
==== UTF8_Short_String_Collapsed ====
 +
: This is the UTF-8 version of the '''ASCII_Short_String_Collapsed''' data type, used in defining metadata values in local dictionaries. Use '''UTF8_String''' for UTF-8 fields in data tables.
  
 
==== UTF8_String ====
 
==== UTF8_String ====
: This data type is based on the XML Schema type ''xs:token'' and corresponds to a non-empty string of UTF-8 characters (which may include whitespace) of unlimited length. Whitespace should be collapsed on input.  This data type is used for describing fields in character tables. Note that UTF-8 characters may be more than one byte long.  Care should be taken when dealing with UTF-8 data in fixed-width tables to ensure that "bytes" and not "characters" are use to calculate locations and value lengths.
+
: This data type is based on the XML Schema type ''xs:token'' and corresponds to a non-empty string of UTF-8 characters (which may include whitespace) of unlimited length. Whitespace should be collapsed on input.  This data type is used for describing fields in character tables. Note that UTF-8 characters may be more than one byte long.  Care should be taken when dealing with UTF-8 data in fixed-width tables to ensure that "bytes" and not "characters" are used to calculate locations and value lengths.
 +
 
 +
==== UTF8_Text_Preserved ====
 +
: This is the UTF-8 version of the '''ASCII_Text_Preserved''' data type, used in defining metadata values in local dictionaries. Use '''UTF8_String''' for UTF-8 fields in data tables.

Revision as of 20:32, 22 July 2020

Following is a glossary of data type definitions for values expressed as strings of characters, extracted from the PDS4 Information Model and master schema. They are used to describe fields defined in local and discipline dictionaries as well as values included in data objects (tables and arrays, for example).

Definitions for binary (hardware) formats used in data files are on the PDS4 Binary Data Type Definitions page.

Last update: 2020-07-22, A.C.Raugh; Master Schema version 1.13.0.0

ASCII Representations

ASCII_AnyURI

Use this for fields that are intended to be interpreted as Uniform Resource Identifiers (URIs). PDS restricts these strings to the ASCII character set, so you should URL-encode any non-ASCII characters in your URIs.

ASCII_Bibcode

Use this type for defining attributes and table fields that represent bibcodes - Bibliographic Reference Codes, used by the ADS, SIMBAD, and NED databases to assign unique codes to literature references.

ASCII_Boolean

This corresponds exactly to the XML Schema data type of "boolean". Valid values are "true", "false", "1" (one), and "0" (zero).

ASCII_Date

This type accepts date information in either the YYYY-MM-DD format or the YYYY-DDD format. In general, it is a very bad idea to allow those two formats to mix in the same field, and all dates should be converted to one format or the other and the specific ASCII_Date_YMD or ASCII_Date_DOY type should be used instead.

ASCII_Date_DOY

This data type is for dates in the YYYY-DDD format (i.e., year followed by day of year).
Usage Note: The date itself is not validated beyond simple numerical ranges, so PDS schema validation will not tell you, for example, that "1999-366" is not a valid date.

ASCII_Date_Time

Like the ASCII_Date type above, this type allows the date portion to be in either YYYY-MM-DD format or YYYY-DDD, followed by the ISO-formatted (Thh:mm:ss.sss) time. In general, it is a very bad idea to allow those two date formats to mix in the same field, and all dates should be converted to one format or the other and the specific ASCII_Date_Time_YMD or ASCII_Date_Time_DOY type should be used instead.

ASCII_Date_Time_DOY

This data type is identical to the ASCII_Date_Time type, except that the date portion must be in the day-of-year format.

ASCII_Date_Time_DOY_UTC

This data type is identical to the ASCII_Date_Time_DOY type, except that the value must have the Z appended to the end to indicate that the value is a UTC date and time.

ASCII_Date_Time_YMD

This data type is identical to the ASCII_Date_Time type, except that the date portion must be in the year-month-day format.

ASCII_Date_Time_YMD_UTC

This data type is identical to the ASCII_Date_Time_YMD type, except that the value must have the Z appended to the end to indicate that the value is a UTC date/time.

ASCII_Date_YMD

This data type is identical to the ASCII_Date_YMD type, except that the date must be in the year-month-day format.

ASCII_Directory_Path_Name

Use this data type for path information. It is constrained to use only the ASCII character set.
Usage Note: All paths in PDS4 labels should be specified using Unix-style notation, and should never be absolute (so they should never begin with either a device identifier or a slash character). This will also typically be true for paths that appear in archival tables, but check with your PDS node if this presents a problem. The schema validation does not enforce these constraints. You should also not assume that fields with this data type include a trailing slash character.

ASCII_DOI

This is a string corresponding to a DOI of the form "10.string/string", where string can be any sequence of one or more non-whitespace characters.

ASCII_File_Name

This data type is a string representing a file name without path information. The characters are constrained to be in the ASCII subset.
Usage Note: Do not assume that any validator will check for file existence unless it specifically claims to do so. Schema validation is very simple and will not, for example, tell you that you have included path information (as indicated by the presence of a slash character), or included values that would be problematic for some or all operating systems (like the asterisk or question mark characters).

ASCII_File_Specification_Name

This data type is for file names with path information. It is effectively the concatenation of the ASCII_Directory_Path_Name and ASCII_File_Name, with an additional slash character as needed. The Usage Notes for those data types apply here as well.

ASCII_Integer

This data type is based on the XML Schema xs:long data type. Values are constrained to be integers in the range -2^63 to (+2^63 - 1). You may include a "+" or '-' sign.

ASCII_LID

This data type is intended to hold PDS4 Logical Identifier (LID) values, without version numbers. All LIDs must begin with "urn:" and may contain only lowercase letters, digits, the characters ('.','-','_'), and the ':' to separate parts of the identifier.

ASCII_LIDVID

This data type represents the concatenation of a PDS4 Logical Identifier (LID) with a Version Identifier (VID), with a double colon ("::") between them. Version identifiers have the required form M.m, where M is the major version number and 'm' is the minor version number.

ASCII_LIDVID_LID

This data type accepts either ASCII_LID or ASCII_LIDVID values.

ASCII_MD5_Checksum

Values of this data type must contain exactly 32 hexadecimal digits.
Usage Note: Do not assume that validators will do a checksum check with this value unless they specifically claim to do so.

ASCII_NonNegative_Integer

This data type includes integers in the range 0 to 2^64. You may not include a "+" sign for values of this type.

ASCII_Numeric_Base16

This data type is a synonym for the XML Schema type xs:hexBinary. Hex digits above 9 may be upper or lower case.

ASCII_Numeric_Base2

This data type is constrained to contain only the digits '1' and '0'.
Usage Note: There is no base indicator allowed in the value, so there is no way for a user who sees the value to know whether the string "101" is supposed to represent the value 5 in binary, or the value 65 in octal, or the decimal value 101. Consequently, SBN strongly recommends that you do not use this data type in either labels or data files.

ASCII_Numeric_Base8

This data type is constrained to contain only the digits '0' through '7'.
Usage Note: There is no base indicator allowed in the value, so there is no way for a user who sees the value to know whether the string "101" is supposed to represent the value 5 in binary, or the value 65 in octal, or the decimal value 101. Consequently, SBN strongly recommends that you do not use this data type in either labels or data files.

ASCII_Real

This data type is a synonym for the XML Schema type xs:double. It accepts values representable in a 64-bit IEEE754 floating point format. It includes simple floating point values as well as exponential notation (i.e., powers of 10). It will not accept the special constants INF for positive infinity, -INF for negative infinity, or NaN for "Not a Number".
Usage Note: For data in archival table products, use the <Special_Constants> class to define flag values for various conditions found in the data.

ASCII_Short_String_Collapsed

Use this data type when defining local dictionary attributes in which you want white space in the value to be normalized (leading/trailing whitespace removed, all other runs of whitespace collapsed to a single blank character) by applications that read the metadata. Normalized values must be less than 256 bytes long. This is the data type used for most text-valued attributes that do not required long, free-form text. This data type has no application outside of dictionary creation; use ASCII_String for text fields in table data.

ASCII_String

This data type is based on the XML Schema type xs:token constrained to the ASCII character set, and corresponds to a non-empty string of ASCII characters (which may include whitespace) of unlimited length. Whitespace should be collapsed on input. This data type is used for describing fields in character tables.

ASCII_Text_Preserved

Use this data type when defining local dictionary attributes in which you want the original line breaks and spacing to be preserved in applications that read the metadata. Typically this is only desirable in free-format text fields of some length, where paragraph breaks and indenting are needed for readability. This data type has no application outside of dictionary creation; use ASCII_String for text fields in table data.

ASCII_Time

This data type is for values that hold a 24-hour clock time in the standard hh:mm:ss.ssss format. The string may optionally end in a Z to indicate a UTC time. The string may be truncated at the appropriate point for the actual precision; omit the ':' separator when there is no value to the right of it. Both 00:00 and 24:00 are valid values.

ASCII_VID

This data type corresponds to a PDS4 Version Identifier (VID). It is a two-part version number of the form N.n, where both N and n are present and non-negative. The major version number (N) may be zero, but may not contain leading zeroes for values greater than zero. So "0.1" is valid, but "01.1" is not.


UTF-8 Representations

UTF8_Short_String_Collapsed

This is the UTF-8 version of the ASCII_Short_String_Collapsed data type, used in defining metadata values in local dictionaries. Use UTF8_String for UTF-8 fields in data tables.

UTF8_String

This data type is based on the XML Schema type xs:token and corresponds to a non-empty string of UTF-8 characters (which may include whitespace) of unlimited length. Whitespace should be collapsed on input. This data type is used for describing fields in character tables. Note that UTF-8 characters may be more than one byte long. Care should be taken when dealing with UTF-8 data in fixed-width tables to ensure that "bytes" and not "characters" are used to calculate locations and value lengths.

UTF8_Text_Preserved

This is the UTF-8 version of the ASCII_Text_Preserved data type, used in defining metadata values in local dictionaries. Use UTF8_String for UTF-8 fields in data tables.