Difference between revisions of "PDS4 Character Data Type Definitions"

From The SBN Wiki
Jump to navigation Jump to search
(Added UTF-8 types)
 
(14 intermediate revisions by 2 users not shown)
Line 3: Line 3:
 
Definitions for binary (hardware) formats used in data files are on the [[PDS4 Binary Data Type Definitions]] page.
 
Definitions for binary (hardware) formats used in data files are on the [[PDS4 Binary Data Type Definitions]] page.
  
'''Last update:''' ''2014-07-17, A.C.Raugh''; Master Schema version 1.2.0.1
+
'''Last update:''' ''2020-07-22, A.C.Raugh''; Master Schema version 1.13.0.0
  
 
== ASCII Representations ==
 
== ASCII Representations ==
Line 9: Line 9:
 
==== ASCII_AnyURI ====
 
==== ASCII_AnyURI ====
 
: Use this for fields that are intended to be interpreted as Uniform Resource Identifiers (URIs).  PDS restricts these strings to the ASCII character set, so you should URL-encode any non-ASCII characters in your URIs.
 
: Use this for fields that are intended to be interpreted as Uniform Resource Identifiers (URIs).  PDS restricts these strings to the ASCII character set, so you should URL-encode any non-ASCII characters in your URIs.
 +
 +
==== ASCII_BibCode ====
 +
: Use this type for defining attributes and table fields that represent ''bibcodes'' - Bibliographic Reference Codes, used by the ADS, SIMBAD, and NED databases to assign unique codes to literature references.
  
 
==== ASCII_Boolean ====
 
==== ASCII_Boolean ====
 
: This corresponds exactly to the XML Schema data type of "boolean". Valid values are "true", "false", "1" (one), and "0" (zero).
 
: This corresponds exactly to the XML Schema data type of "boolean". Valid values are "true", "false", "1" (one), and "0" (zero).
 
==== ASCII_Date ====
 
: This data type holds a calendar date only (that is, without a time element), in either the standard ''YYYY-MM-DD'' format or the alternate ''YYYY-DDD'' format. The date may be preceded by a negative sign for negative UTC years (year -1 UTC is year 2 B.C.). You may truncate the value to the appropriate accuracy (so day and month may be excluded if only the year is known).  Do not include the hyphen separator for fields that are omitted (that is, use ''1977'', not ''1977--'').  The '''Z''' suffix indicating a UTC date is prohibited in these values.
 
: '''Usage Note:''' The date itself is not validated beyond simple numerical ranges, so PDS schema validation will not warn you, for example, that 2001-02-29 is not a valid date.
 
  
 
==== ASCII_Date_DOY ====
 
==== ASCII_Date_DOY ====
: This data type is identical to the '''ASCII_Date''' type except that the date ''must'' be in the day-of-year format.
+
: This data type is for dates in the ''YYYY-DDD'' format (i.e., year followed by day of year).
 
: '''Usage Note:''' The date itself is not validated beyond simple numerical ranges, so PDS schema validation will not tell you, for example, that "1999-366" is not a valid date.
 
: '''Usage Note:''' The date itself is not validated beyond simple numerical ranges, so PDS schema validation will not tell you, for example, that "1999-366" is not a valid date.
 
==== ASCII_Date_Time ====
 
: This data type is the most general way to indicate a calendar date with a time.  It is in one of the standard formats ''YYYY-MM-DD'''''T'''''hh:mm:ss.ssss'' or ''YYYY-DOY'''''T'''''hh:mm:ss.ssss''.  It can be truncated to the appropriate accuracy, even back to just a year value.  You may precede the year with a negative sign for negative UTC years (the year -1 UTC is the same year as 2 B.C.).  You may also include a 'Z' at the end of the string to indicate the value is a UTC date/time.
 
: '''Usage Note:''' The date itself is not validated beyond simple numerical ranges, so PDS schema validation will not warn you, for example, that 2001-02-29 is not a valid date.
 
  
 
==== ASCII_Date_Time_DOY ====
 
==== ASCII_Date_Time_DOY ====
 
: This data type is identical to the '''ASCII_Date_Time''' type, except that the date portion ''must'' be in the day-of-year format.
 
: This data type is identical to the '''ASCII_Date_Time''' type, except that the date portion ''must'' be in the day-of-year format.
  
==== ASCII_Date_Time_UTC ====
+
==== ASCII_Date_Time_DOY_UTC ====
: This data type is identical to the '''ASCII_Date_TIme''' type, except that the value ''must'' have the '''Z''' appended to the end to indicate that the value is a UTC date/time.
+
: This data type is identical to the '''ASCII_Date_Time_DOY''' type, except that the value ''must'' have the '''Z''' appended to the end to indicate that the value is a UTC date and time.
  
 
==== ASCII_Date_Time_YMD ====
 
==== ASCII_Date_Time_YMD ====
 
: This data type is identical to the '''ASCII_Date_Time''' type, except that the date portion ''must'' be in the year-month-day format.
 
: This data type is identical to the '''ASCII_Date_Time''' type, except that the date portion ''must'' be in the year-month-day format.
 +
 +
==== ASCII_Date_Time_YMD_UTC ====
 +
: This data type is identical to the '''ASCII_Date_Time_YMD''' type, except that the value ''must'' have the '''Z''' appended to the end to indicate that the value is a UTC date/time.
  
 
==== ASCII_Date_YMD ====
 
==== ASCII_Date_YMD ====
Line 52: Line 50:
  
 
==== ASCII_Integer ====
 
==== ASCII_Integer ====
: This data type is a direct synonym for the XML Schema ''xs:int'' data type, so values are constrained to be in the range -2147483648 and 2147483647 (singed 32-bit integers, in hardware terms).
+
: This data type is based on the XML Schema ''xs:long'' data type. Values are constrained to be integers in the range -2^63 to (+2^63 - 1).  You may include a "+" or '-' sign.
  
 
==== ASCII_LID ====
 
==== ASCII_LID ====
: This data type is intended to hold PDS4 Logical Identifier (LID) values, without version numbers.  It is constrained to be at least 14 characters long and to use only ASCII characters.
+
: This data type is intended to hold PDS4 Logical Identifier (LID) values, without version numbers.  All LIDs must begin with "urn:" and may contain only lowercase letters, digits, the characters ('.','-','_'), and the ':' to separate parts of the identifier.
: '''Usage Note:''' No format checking is done on these values, so schema validation cannot warn you, for example, that "URN:NASA:PDS:MYBUNDLE" is invalid (because it violates the PDS4 Standards lowercase requirements). Do not assume a data object validator is doing format checking of LID values unless it explicitly claims to.
 
  
 
==== ASCII_LIDVID ====
 
==== ASCII_LIDVID ====
: This data type represents the concatenation of a PDS4 Logical Identifier (LID) with a Version Identifier (VID), with a double colon ("::") between them. It is constrained to be at least 19 characters long and to use only ASCII characters.
+
: This data type represents the concatenation of a PDS4 Logical Identifier (LID) with a Version Identifier (VID), with a double colon ("::") between them. Version identifiers have the required form ''M.m'', where ''M'' is the major version number and 'm' is the minor version number.
: '''Usage Note:''' No format checking is done on these values, so schema validation cannot warn you, for example, that "URN:NASA:PDS:MYBUNDLE::1.0" is invalid (because it violates the PDS4 Standards lowercase requirements). Do not assume a data object validator is doing format checking of LIDVID values unless it explicitly claims to.
 
  
 
==== ASCII_LIDVID_LID ====
 
==== ASCII_LIDVID_LID ====
: This data type accepts either '''ASCII_LID''' or '''ASCII_LIDVID''' values.  See the '''Usage Notes''' for those data types.
+
: This data type accepts either '''ASCII_LID''' or '''ASCII_LIDVID''' values.  
  
 
==== ASCII_MD5_Checksum ====
 
==== ASCII_MD5_Checksum ====
Line 70: Line 66:
  
 
==== ASCII_NonNegative_Integer ====
 
==== ASCII_NonNegative_Integer ====
: This data type includes integers in the range 0 to 9223372036854775807.  You may include a "+" sign, if so moved.
+
: This data type includes integers in the range 0 to 2^64.  You may ''not'' include a "+" sign for values of this type.
  
 
==== ASCII_Numeric_Base16 ====
 
==== ASCII_Numeric_Base16 ====
Line 77: Line 73:
 
==== ASCII_Numeric_Base2 ====
 
==== ASCII_Numeric_Base2 ====
 
: This data type is constrained to contain only the digits '1' and '0'.
 
: This data type is constrained to contain only the digits '1' and '0'.
: '''Usage Note:''' There is no base indicator allowed in the value, so there is no way for a user who sees the value to know whether the string "101" is supposed to represent the value 7 in binary, or the value 65 in octal, or the decimal value 101. Consequently, SBN strongly recommends that you '''''do not use this data type'' in either labels or data files.'''
+
: '''Usage Note:''' There is no base indicator allowed in the value, so there is no way for a user who sees the value to know whether the string "101" is supposed to represent the value 5 in binary, or the value 65 in octal, or the decimal value 101. Consequently, SBN strongly recommends that you '''''do not use this data type'' in either labels or data files.'''
  
 
==== ASCII_Numeric_Base8 ====
 
==== ASCII_Numeric_Base8 ====
 
: This data type is constrained to contain only the digits '0' through '7'.
 
: This data type is constrained to contain only the digits '0' through '7'.
: '''Usage Note:''' There is no base indicator allowed in the value, so there is no way for a user who sees the value to know whether the string "101" is supposed to represent the value 7 in binary, or the value 65 in octal, or the decimal value 101. Consequently, SBN strongly recommends that you '''''do not use this data type'' in either labels or data files.'''
+
: '''Usage Note:''' There is no base indicator allowed in the value, so there is no way for a user who sees the value to know whether the string "101" is supposed to represent the value 5 in binary, or the value 65 in octal, or the decimal value 101. Consequently, SBN strongly recommends that you '''''do not use this data type'' in either labels or data files.'''
  
 
==== ASCII_Real ====
 
==== ASCII_Real ====
: This data type is a synonym for the XML Schema type ''xs:double''.  It accepts values representable in a 64-bit IEEE754 floating point format.  It includes simple floating point values as well as exponential notation (i.e., powers of 10), as well as the special constants ''INF'' for positive infinity, ''-INF'' for negative infinity, and ''NaN'' for "Not a Number".  Case counts for these special values.
+
: This data type is a synonym for the XML Schema type ''xs:double''.  It accepts values representable in a 64-bit IEEE754 floating point format.  It includes simple floating point values as well as exponential notation (i.e., powers of 10). It will '''''not''''' accept the special constants ''INF'' for positive infinity, ''-INF'' for negative infinity, or ''NaN'' for "Not a Number".
: '''Usage Note:''' The special constants for +/- infinity and NaN ''should not appear'' in archival data - either in labels or in data tables. In labels, declare attributes as nil or omit them entirely; in data tables, define a numeric constant to use as a flag for missing data.
+
: '''Usage Note:''' For data in archival table products, use the ''<Special_Constants>'' class to define flag values for various conditions found in the data.
  
 
==== ASCII_Short_String_Collapsed ====
 
==== ASCII_Short_String_Collapsed ====
: This data type is based on the XML Schema type ''xs:token'' and contains a string of 1-255 ASCII characters.  Whitespace should be collapsed on input. It is used to define short, unformatted string values for label attributes.  (In data files, use the '''ASCII_String''' type.)
+
: Use this data type when defining local dictionary attributes in which you want white space in the value to be normalized (leading/trailing whitespace removed, all other runs of whitespace collapsed to a single blank character) by applications that read the metadata. Normalized values must be less than 256 bytes long. This is the data type used for most text-valued attributes that do not required long, free-form text. This data type has no application outside of dictionary creation; use '''ASCII_String''' for text fields in table data.
: '''Usage Note:''' Do not assume that your XML parser will necessarily collapse whitespace for you when handling strings of this data type.  Even a schema-aware parser cannot do that if it cannot find the referenced schema.
 
 
 
==== ASCII_Short_String_Preserved ====
 
: This data type is based on the XML Schema type ''xs:string'' and is constrained to be 1-255 ASCII characters long. Whitespace is preserved in these strings.
 
: '''Usage Note:''' The byte count limit makes the whitespace preservation property of this data type problematic, even for defining values of label attributes.  For this reason, SBN recommends you '''''do not use this data type'''''. If you need to preserve formatting, use the '''ASCII_Text_Preserved''' type.
 
  
 
==== ASCII_String ====
 
==== ASCII_String ====
: This data type is based on the XML Schema type ''xs:token'' and corresponds to a non-empty string of ASCII characters (which may include whitespace) of unlimited length. Whitespace should be collapsed on input.  This data type is used for describing fields in character tables.
+
: This data type is based on the XML Schema type ''xs:token'' constrained to the ASCII character set, and corresponds to a non-empty string of ASCII characters (which may include whitespace) of unlimited length. Whitespace should be collapsed on input.  This data type is used for describing fields in character tables.  
 
 
==== ASCII_Text_Collapsed ====
 
: This data type is based on the XML Schema type ''xs:token'' and corresponds to a non-empty string of ASCII characters of unlimited length. Whitespace should be collapsed on input.  This data type is used to define long, unformatted text string values for label attributes.
 
: '''Usage Note:''' Do not assume that your XML parser will necessarily collapse whitespace for you when handling strings of this data type.  Even a schema-aware parser cannot do that if it cannot find the referenced schema.  Also, long strings of unformatted text are not, in general, user-friendly.  SBN recommends that you '''''do not use this data type''''' in your mission dictionary definitions.
 
  
 
==== ASCII_Text_Preserved ====
 
==== ASCII_Text_Preserved ====
: This data type is based on the XML Schema type ''xs:string'' and corresponds to a non-empty string of ASCII characters of unlimited length in which whitespace is preserved. This data type is used to define long, formatted text block values for label attributes (like comments and descriptions).
+
: Use this data type when defining local dictionary attributes in which you want the original line breaks and spacing to be preserved in applications that read the metadata. Typically this is only desirable in free-format text fields of some length, where paragraph breaks and indenting are needed for readability. This data type has no application outside of dictionary creation; use '''ASCII_String''' for text fields in table data.
  
 
==== ASCII_Time ====
 
==== ASCII_Time ====
Line 114: Line 101:
  
 
==== UTF8_Short_String_Collapsed ====
 
==== UTF8_Short_String_Collapsed ====
: This data type is based on the XML Schema type ''xs:token'' and contains a string of UTF-8 characters up to 255 bytes long.  Whitespace should be collapsed on input. It is used to define short, unformatted string values for label attributes that require access to the entire UTF-8 character set (for non-ASCII characters and symbols, e.g.).  (In data files, use the '''UTF8_String''' type.)
+
: This is the UTF-8 version of the '''ASCII_Short_String_Collapsed''' data type, used in defining metadata values in local dictionaries. Use '''UTF8_String''' for UTF-8 fields in data tables.
: '''Usage Note:''' Do not assume that your XML parser will necessarily collapse whitespace for you when handling strings of this data type.  Even a schema-aware parser cannot do that if it cannot find the referenced schema.
 
 
 
==== UTF8_Short_String_Preserved ====
 
: This data type is based on the XML Schema type ''xs:string'' and is constrained to be a string of UTF-8 characters up to 255 bytes long.  Whitespace is preserved in these strings.
 
: '''Usage Note:''' The byte count limit makes the whitespace preservation property of this data type problematic, even for defining values of label attributes.  For this reason, SBN recommends you '''''do not use this data type'''''. If you need to preserve formatting, use the '''UTF8_Text_Preserved''' type.
 
  
 
==== UTF8_String ====
 
==== UTF8_String ====
: This data type is based on the XML Schema type ''xs:token'' and corresponds to a non-empty string of UTF-8 characters (which may include whitespace) of unlimited length. Whitespace should be collapsed on input.  This data type is used for describing fields in character tables.  
+
: This data type is based on the XML Schema type ''xs:token'' and corresponds to a non-empty string of UTF-8 characters (which may include whitespace) of unlimited length. Whitespace should be collapsed on input.  This data type is used for describing fields in character tables. Note that UTF-8 characters may be more than one byte long.  Care should be taken when dealing with UTF-8 data in fixed-width tables to ensure that "bytes" and not "characters" are used to calculate locations and value lengths.
  
 
==== UTF8_Text_Preserved ====
 
==== UTF8_Text_Preserved ====
: This data type is based on the XML Schema type ''xs:string'' and corresponds to a non-empty string of UTF-8 characters of unlimited length in which whitespace is preserved. This data type is used to define long, formatted text block values for label attributes (like comments and descriptions) in which the data preparer wants or needs to use non-ASCII characters or symbols.
+
: This is the UTF-8 version of the '''ASCII_Text_Preserved''' data type, used in defining metadata values in local dictionaries. Use '''UTF8_String''' for UTF-8 fields in data tables.

Latest revision as of 17:30, 27 July 2020

Following is a glossary of data type definitions for values expressed as strings of characters, extracted from the PDS4 Information Model and master schema. They are used to describe fields defined in local and discipline dictionaries as well as values included in data objects (tables and arrays, for example).

Definitions for binary (hardware) formats used in data files are on the PDS4 Binary Data Type Definitions page.

Last update: 2020-07-22, A.C.Raugh; Master Schema version 1.13.0.0

ASCII Representations

ASCII_AnyURI

Use this for fields that are intended to be interpreted as Uniform Resource Identifiers (URIs). PDS restricts these strings to the ASCII character set, so you should URL-encode any non-ASCII characters in your URIs.

ASCII_BibCode

Use this type for defining attributes and table fields that represent bibcodes - Bibliographic Reference Codes, used by the ADS, SIMBAD, and NED databases to assign unique codes to literature references.

ASCII_Boolean

This corresponds exactly to the XML Schema data type of "boolean". Valid values are "true", "false", "1" (one), and "0" (zero).

ASCII_Date_DOY

This data type is for dates in the YYYY-DDD format (i.e., year followed by day of year).
Usage Note: The date itself is not validated beyond simple numerical ranges, so PDS schema validation will not tell you, for example, that "1999-366" is not a valid date.

ASCII_Date_Time_DOY

This data type is identical to the ASCII_Date_Time type, except that the date portion must be in the day-of-year format.

ASCII_Date_Time_DOY_UTC

This data type is identical to the ASCII_Date_Time_DOY type, except that the value must have the Z appended to the end to indicate that the value is a UTC date and time.

ASCII_Date_Time_YMD

This data type is identical to the ASCII_Date_Time type, except that the date portion must be in the year-month-day format.

ASCII_Date_Time_YMD_UTC

This data type is identical to the ASCII_Date_Time_YMD type, except that the value must have the Z appended to the end to indicate that the value is a UTC date/time.

ASCII_Date_YMD

This data type is identical to the ASCII_Date_YMD type, except that the date must be in the year-month-day format.

ASCII_Directory_Path_Name

Use this data type for path information. It is constrained to use only the ASCII character set.
Usage Note: All paths in PDS4 labels should be specified using Unix-style notation, and should never be absolute (so they should never begin with either a device identifier or a slash character). This will also typically be true for paths that appear in archival tables, but check with your PDS node if this presents a problem. The schema validation does not enforce these constraints. You should also not assume that fields with this data type include a trailing slash character.

ASCII_DOI

This is a string corresponding to a DOI of the form "10.string/string", where string can be any sequence of one or more non-whitespace characters.

ASCII_File_Name

This data type is a string representing a file name without path information. The characters are constrained to be in the ASCII subset.
Usage Note: Do not assume that any validator will check for file existence unless it specifically claims to do so. Schema validation is very simple and will not, for example, tell you that you have included path information (as indicated by the presence of a slash character), or included values that would be problematic for some or all operating systems (like the asterisk or question mark characters).

ASCII_File_Specification_Name

This data type is for file names with path information. It is effectively the concatenation of the ASCII_Directory_Path_Name and ASCII_File_Name, with an additional slash character as needed. The Usage Notes for those data types apply here as well.

ASCII_Integer

This data type is based on the XML Schema xs:long data type. Values are constrained to be integers in the range -2^63 to (+2^63 - 1). You may include a "+" or '-' sign.

ASCII_LID

This data type is intended to hold PDS4 Logical Identifier (LID) values, without version numbers. All LIDs must begin with "urn:" and may contain only lowercase letters, digits, the characters ('.','-','_'), and the ':' to separate parts of the identifier.

ASCII_LIDVID

This data type represents the concatenation of a PDS4 Logical Identifier (LID) with a Version Identifier (VID), with a double colon ("::") between them. Version identifiers have the required form M.m, where M is the major version number and 'm' is the minor version number.

ASCII_LIDVID_LID

This data type accepts either ASCII_LID or ASCII_LIDVID values.

ASCII_MD5_Checksum

Values of this data type must contain exactly 32 hexadecimal digits.
Usage Note: Do not assume that validators will do a checksum check with this value unless they specifically claim to do so.

ASCII_NonNegative_Integer

This data type includes integers in the range 0 to 2^64. You may not include a "+" sign for values of this type.

ASCII_Numeric_Base16

This data type is a synonym for the XML Schema type xs:hexBinary. Hex digits above 9 may be upper or lower case.

ASCII_Numeric_Base2

This data type is constrained to contain only the digits '1' and '0'.
Usage Note: There is no base indicator allowed in the value, so there is no way for a user who sees the value to know whether the string "101" is supposed to represent the value 5 in binary, or the value 65 in octal, or the decimal value 101. Consequently, SBN strongly recommends that you do not use this data type in either labels or data files.

ASCII_Numeric_Base8

This data type is constrained to contain only the digits '0' through '7'.
Usage Note: There is no base indicator allowed in the value, so there is no way for a user who sees the value to know whether the string "101" is supposed to represent the value 5 in binary, or the value 65 in octal, or the decimal value 101. Consequently, SBN strongly recommends that you do not use this data type in either labels or data files.

ASCII_Real

This data type is a synonym for the XML Schema type xs:double. It accepts values representable in a 64-bit IEEE754 floating point format. It includes simple floating point values as well as exponential notation (i.e., powers of 10). It will not accept the special constants INF for positive infinity, -INF for negative infinity, or NaN for "Not a Number".
Usage Note: For data in archival table products, use the <Special_Constants> class to define flag values for various conditions found in the data.

ASCII_Short_String_Collapsed

Use this data type when defining local dictionary attributes in which you want white space in the value to be normalized (leading/trailing whitespace removed, all other runs of whitespace collapsed to a single blank character) by applications that read the metadata. Normalized values must be less than 256 bytes long. This is the data type used for most text-valued attributes that do not required long, free-form text. This data type has no application outside of dictionary creation; use ASCII_String for text fields in table data.

ASCII_String

This data type is based on the XML Schema type xs:token constrained to the ASCII character set, and corresponds to a non-empty string of ASCII characters (which may include whitespace) of unlimited length. Whitespace should be collapsed on input. This data type is used for describing fields in character tables.

ASCII_Text_Preserved

Use this data type when defining local dictionary attributes in which you want the original line breaks and spacing to be preserved in applications that read the metadata. Typically this is only desirable in free-format text fields of some length, where paragraph breaks and indenting are needed for readability. This data type has no application outside of dictionary creation; use ASCII_String for text fields in table data.

ASCII_Time

This data type is for values that hold a 24-hour clock time in the standard hh:mm:ss.ssss format. The string may optionally end in a Z to indicate a UTC time. The string may be truncated at the appropriate point for the actual precision; omit the ':' separator when there is no value to the right of it. Both 00:00 and 24:00 are valid values.

ASCII_VID

This data type corresponds to a PDS4 Version Identifier (VID). It is a two-part version number of the form N.n, where both N and n are present and non-negative. The major version number (N) may be zero, but may not contain leading zeroes for values greater than zero. So "0.1" is valid, but "01.1" is not.

UTF-8 Representations

UTF8_Short_String_Collapsed

This is the UTF-8 version of the ASCII_Short_String_Collapsed data type, used in defining metadata values in local dictionaries. Use UTF8_String for UTF-8 fields in data tables.

UTF8_String

This data type is based on the XML Schema type xs:token and corresponds to a non-empty string of UTF-8 characters (which may include whitespace) of unlimited length. Whitespace should be collapsed on input. This data type is used for describing fields in character tables. Note that UTF-8 characters may be more than one byte long. Care should be taken when dealing with UTF-8 data in fixed-width tables to ensure that "bytes" and not "characters" are used to calculate locations and value lengths.

UTF8_Text_Preserved

This is the UTF-8 version of the ASCII_Text_Preserved data type, used in defining metadata values in local dictionaries. Use UTF8_String for UTF-8 fields in data tables.