Difference between revisions of "PDS4 field format Conventions"

From The SBN Wiki
Jump to navigation Jump to search
Line 85: Line 85:
  
 
;Precision for strings and integers
 
;Precision for strings and integers
:The POSIX standard does interpret the '''precision''' value for strings and integers, using it to indicate the minimum number of digits or characters to print for each.  Integer values with fewer than '''precision''' digits are ''zero-padded on the left''; strings shorter than '''precision''' characters are blank-padded on the right if the specification begins with "'''-'''", otherwise they're padded on the left ("blank-filled", if you prefer).
+
:The POSIX standard does interpret the '''precision''' value for strings and integers, using it to indicate the minimum number of digits or characters to print for each.  Integer values with fewer than '''precision''' digits are ''zero-padded on the left''; strings shorter than '''precision''' characters are blank-padded on the right if the specification begins with "'''-'''", otherwise they're padded on the left ("blank-filled", if you prefer).  Seemed to me like these could be problematic in an archive, so I prohibited them.
  
 
;"+" to force a sign
 
;"+" to force a sign

Revision as of 18:34, 7 September 2012

The <field_format> attribute found in table field definitions is equivalent to the PDS3 FORMAT keyword in COLUMN definitions. But in PDS4 we will be using a subset of the POSIX I/O conversion specifiers (as used in most modern programming languages with a printf statement or the equivalent), rather then the FORTRAN specifiers used in PDS3.



NOTE: As of this writing, this is a proposal, not a standard. The details have not been discussed by the DDWG, nor has an official standard been written. So check the actual standard before submitting data for review or archiving.



Why bother?

The field_format attribute provides two potential benefits to users and the archive:

  1. In character tables, the field_format specification provides width and precision information that can be used in validating individual values in the table data.
  2. In binary tables, the field_format specification can be used as an output format for converting binary numeric values to a character form without losing or overstating significant digits.


The POSIX Standard

The latest edition of the relevant standard is IEEE Std 1003.1-2004. The subset we're selecting is compatible with the 2001 version of the same standard, which in turn defers to the ISO C standard for printf conventions. Specifically, I'm referencing section 5: "File Notation Conventions".

Formation Rule

The formation rule for a field_format value is:

[-]width[.precision]specifier

where square brackets indicate an optional component, and:

width
is the total potential width of the field (i.e., the width of the widest value occuring in the field)
precision
is the number of digits following the decimal point for real numbers (but is otherwise ignored)
specifier
is exactly one of [doxfes]


Breaking this rule down into separate parts...


[-]
In the case of a string-valued field, an initial "-" indicates that the string has been or should be left-justified. This is actually the preferred way to present most string values in character tables, so the field_format value for fields with a data type of ASCII_String will nearly always begin with a "-". Numeric fields should, in general, be decimal-aligned, so the use of "-" for fields with a data type of, for example, ASCII_Integer should be avoided.
width
The width is an integer value indicating the maximum number of characters needed for the complete representation of the largest (in terms of display bytes, not necessarily magnitude) value occuring, or potentially occuring, in the field. This should include bytes for signs, decimal points, and exponents. In the case of string values, it should be the maximum width from the first non-blank character to the last non-blank character. It should not include bytes for field delimiters, which are not considered part of the field.
The width is separated from the precision by a decimal point ("."). If there is no precision specified, the decimal point must be omitted.
precision
The precision is an integer value indicating the number of digits to the right of the decimal point in a floating-point number representation. It should only be used in the field_format values of real-valued data (fields with data types of ASCII_Real or IEEE754Double, for example).
specifier
The specifier indicates the broad data type for display. It will be one of a subset of the conversion specifiers included in the IEEE standard:
d
A decimal integer
o
An unsigned octal integer
x
An unsigned hexidecimal number
f
A floating point number in the format [-]ddd.ddd, where the actual number of digits before and after the decimal point are determined by the preceding width and precision values (note that the width includes the decimal point and any sign).
e
A floating point number in the format [-]d.ddde+/-dd where "+/-" stands for exactly one character (either "+" or "-"), there is always exactly one digit to the left of the decimal point, and the number of digits to the right of the decimal point is determined by the preceding precision value.
s
A string value. Note that strings should generally be left-justified in fixed width character tables and on output from a binary table, so most field_format values ending in "s" should begin with "-"



Variations on the Theme

The proposal is intended to be a very limited subset of the total universe of possible format conversion specifiers. Here are some things I left out that people might want to include (remember, we're talking about formats that will be useful in the archive labels):

"i" conversion specifier
The "i" specifier is identical to the "d" specifier in every way. I chose "d" over "i" because it was mnemonic ("d" for decimal, "o" for octal, "x" for hexadecimal - OK, so it's not a perfect system), but people may prefer the "i" because it looks more like the familiar FORTRAN specifier.
"X" and "E" specifiers
These uppercase specifiers only matter on output - they cause the letters in their formats ("a"-"f" for hexadecimal, "e" for reals) to be uppercase rather than lowercase. Seemed like a pointless complication to me, so I ignored it. Programmers who care can do what they want.
Precision for strings and integers
The POSIX standard does interpret the precision value for strings and integers, using it to indicate the minimum number of digits or characters to print for each. Integer values with fewer than precision digits are zero-padded on the left; strings shorter than precision characters are blank-padded on the right if the specification begins with "-", otherwise they're padded on the left ("blank-filled", if you prefer). Seemed to me like these could be problematic in an archive, so I prohibited them.
"+" to force a sign
Numeric specifications may begin with a "+" to indicate that the value should always have an explicit sign. The POSIX standard allows a specifier to begin with both "+" and "-" (so you could have a left-justified number value with an explicit sign). I'd be in favor of allowing "-" for strings only and "+" for numerics only. Not sure I want to allow people the option of left-justifying numbers at all.
"g", "G" and "c" specifiers
The "g|G" specifier switches back and forth between floating point and exponential notation, depending on the magnitude of the output value. I'm not a big fan of that for archive tables, but perhaps it is useful. The 'c' specifier is for outputting a single byte at a time. I think "1s" is good enough for us, and the subtle distinction between "40c" and "40s" is not something I want to spend the next 10 years explaining.
The leading "%"
Format specifiers actually begin with a "%" character, It seemed pointlessly pedantic to require users to supply this, though, so I left it out. On the other hand, it's as pervasive as the format specifiers now, so maybe we should require it to help users adjust to the new system (by constantly reminding them "This is not a FORTRAN specification").