Difference between revisions of "Filling Out the Table Delimited Data Structure"

From The SBN Wiki
Jump to navigation Jump to search
Line 9: Line 9:
''Note that in the PDS4 master schema, all classes have capitalized names; attributes never do.''
''Note that in the PDS4 master schema, all classes have capitalized names; attributes never do.''
== <name&l=gt; ==
== <name> ==

Revision as of 21:05, 27 February 2013

The <Table_Delimited> class describes contains the information needed to parse a string of character bytes in a delimited table format into s table structure in programmatic memory. It is very similar to the <Table_Character>, the principal difference being on the I/O end of things - Table_Character, as a fixed-width table, could potentially allow users to directly access individual rows and fields directly; Table_Delimited allows only serial access to the records and fields.

N.B.: PDS is in the process of defining a standard comprising parsing rules for extracting tabular data from a text file with delimited fields and records. That standard defines the I/O processing for the <Table_Delimited> data.

For additional explanation, see the PDS4 Standards Reference, or contact your PDS node consultants.

Following are the attributes and subclasses you'll find in <Table_Character>, in label order.

Note that in the PDS4 master schema, all classes have capitalized names; attributes never do.



If you'd like to give your table a name, do it here.



If you need to reference this <Table_Delimited> from elsewhere in the same label, give it an identifier here. If the identifier uses the same syntax as an average variable name in a typical programming language, you should be OK syntactically.



This is the offset, in bytes, from the beginning of the file to the beginning of the Table_Delimited data. Offsets begin at zero. You must indicate a unit of bytes for this attribute:

    <offset unit="byte">1234567890</offset>



If you know the total length of the Table_Delimited data, including all delimiter, line break characters and filler space, you can list it here. You must include a unit of bytes for this value. For example:

    <object_size unit="byte">10240</object_size>



This attribute must have the standard value PDS_DSV V1.0.



This attribute must have the standard value Character.



This is a free-format text field for any additional comments you might care to include at this point.



This attribute must contain the total number of records in the Table_Delimited data.



This attribute must contain the standard value carriage_return line_feed. Note the the data must have carriage-return/linefeed delimited records.



This must have one of the standard values:

  • comma
  • horizontal_tab
  • semicolon
  • vertical_bar'

Note that the parsing rules standard will contain additional information about leading and trailing blanks, possible additional delimiters around text fields, the significance of leading and trailing space, and so on.



If the records comprising the Table_Delimited are evenly spaced through some dimension (like time, distance, wavelength, etc.), use this class to define that dimension and spacing. For details on using this class, see Filling Out the Table Character Data Structure - Uniformly_Sampled section.



This class defines the repeating series of fields contained in one complete record of the Table_Delimited data.



The number of fields in the Record.

Note: There is some controversy about what this value should be when Group Fields are present. For the time being, use the data dictionary definition - the number of fields in the record is the total number of scalar values in the record; so it is the sum of all Field definitions in the record, and the number of Field definitions in each Group Field multiplied by the repetitions value of that group.



The length of the longest single record in the Table_Delimited data, including all fields, all repetitions of group fields, any space between fields, and the record delimiters. You must specify a unit of bytes for this value:

    <record_length unit="byte">1234</record_length>
Records are composed of Fields and Group Fields. A Record must have at least one of those, and can have an arbitrary number of them, in any order (that is, you can have Fields and Group Fields interspersed).
There are currently serious arguments underway associated with using Group Fields, in particular when attempting to determine the correct value for the required <fields> attribute, above. Because of this, SBN data preparers should not use Group Fields until the disagreements have been solved. Group Fields are never necessary - they are a notational convenience to save writing out large numbers of similar Field definitions.


The class defines a single scalar field.



The name of the field. SBN recommends that this be something fairly human-readable that can be easily turned into a variable name for use in applications, or displayed as a meaningful column heading.



This is the sequential number of the Field definition. This is poorly defined when Group Fields are present, and should probably not be used at all in that case. The field_number is intended to be a help to human readers trying to map field definitions to columns in a print-out of the Table.

Note: The fact that this attribute it optional is potentially problematic in a delimited table. Since there is no <field_location> to use to determine field order, in the absence of this attribute field order in the file must be inferred from the order of the <Field_Delimited> classes. Fortunately, XML is inherently ordered, so the obvious thing to do does the right thing (though Group_Field_Delimited classes increase the complexity and likelihood of unpleasant surprises); but this is a logical hole in the model, which is supposed to be implementation-agnostic.

Until this point is addressed, SBN strongly suggests you include field_sequence_numbers in you Field_Delimited classes.



The type of the values in the field. This must be one of the values listed in the Standard Values Quick Reference.



The greatest number of bytes in the longest instance of this field in the delimited table. You must specify the unit:

    <maximum_field_length unit="byte">12</maximum_field_length>

Note:This value is ambiguous within the context of a <Group_Field_Delimited> class.|}



The value of this attribute is a string representing the read/print format for the data in the field, using a subset of the POSIX print conventions.

Note: The syntax of the content of this field is poorly defined in the current data dictionary.

The SBN has defined a subset of the POSIX standard for use in SBN data sets on the PDS4_field_format_Conventions page.

SBN will require that this attribute be present in all Field definitions. It is used for validation of the Table contents.



If the value in this field has an associated unit, this is where it goes. This value is case sensitive, and you may use characters from the UTF-8 character set (like the Angstrom symbol) where appropriate.

Note: If a field contains a unitless value, then there should be no <unit> attribute. NEVER include a null unit value, or even worse, this: <unit>N/A</unit>.



If the data in this field are scaled, this attribute should contain the value the data must be multiplied by to get back to the original value. Scaling factors are applied prior to adding any offset.



If the values in the field have been shifted by an offset, this attribute should contain the value that must be added to each field value to get back to the original value. Offsets and added after the scaling factor, if any.



Free-format text describing the content of the field.

Note: While not required, SBN expects to see a useful definition for every Field, as do both reviewers and users. Omit this field at your peril.



This class defines flag values used to indicate that a particular field value is unknown for one reason or another. It is identical to the <Special_Constants> class used in the Array classes. For details, check the Filling Out the Array 2D Data Structure - <Special_Constants> page. Here is a quick list of the special constants available in this class:

  • saturated_constant
  • missing_constant
  • error_constant
  • invalid_constant
  • unknown_constant
  • not_applicable_constant



If you want to include things like extrema, mean value, and such for all the values that occur in this field through all the records in the table, this is the place to do it. This class is identical for all Field types. For details, see Filling Out the Field Statistics Class. Here is a quick list of the field statistics available in this class:

  • maximum
  • minimum
  • mean
  • standard_deviation
  • median


This class defines a set of Fields that repeats a given number of times in each record. Unlike in fixed-width tables, in a delimited table Group Fields may not be nested.

Note: Unless you have three good reasons, don't use Group Fields in SBN data.

NOTE: While I can easily understand the general case for not allowing nested Group_Fields in any table, I don't see a good reason for allowing them in fixed-width tables and forbidding them in delimited tables.



The number of times the complete set of Fields comprising this <Group_Field_Delimited> repeats.

Note: The minimum value for this field listed in the data dictionary is one, but no product will pass SBN review unless this value is at least two.



This proper way the calculate the value for this attribute is the subject of debate as I type this. For the time being, follow the rules in the data dictionary: Count the number of <Field_Delimited> classes in the group; then multiply by the value of <repetitions> for the group field.


As in the Record_Delimited, the Group_Field_Delimited contains a series of Field_Delimited classes. There must be at least one. The <Field_Delimited> classes inside a Group_Field_Delimited have the same structure and constraints (and issues) as the those in the Record_Delimited.