Filling Out the Document Format Set Classes

From The SBN Wiki
Jump to navigation Jump to search

The <Document_Format_Set> class contains a brief description of one of possibly several different physical forms of the same document. For example, a PDF file is one common format for documents, and generally consists of a single file. A Document_Format_Set containing a single Document_File subclass would be used to label that single file. Alternatively, in PDS3, most documents were presented as an ASCII text file and a series of separate graphics files (PNG, GIF, JPEG) for the figures (graphics and/or images). The combined text and graphics would constitute a single form of the document and would be described in PDS4 using a single Document_Format_Set with multiple Document_File subclasses.

For additional explanation, see the PDS4 Standards Reference, or contact your PDS node consultant.

Following are the attributes and subclasses you'll find in the Document_Format_Set, in label order.

Note that in the PDS4 master schema, all classes have capiltaized names; attributes never do.



This class must occur exactly once. It provides information that applies to this physical format of the document taken as a whole.



This is useful if the document format you're describing contains more than one file. The <local_identifier> corresponding to the file that should be considered the starting point for reading the document should be the value of this attribute (you'll have to define a <local_identifier> for that file, of course).

For example, in the ASCII text plus graphics files format, example, the <Document_File> object for the ASCII text file should contain a local_identifier so it can be identified here as the main file for that document format.



The format_type must be set to one of the values single file or multiple file, depending on whether there is one <Document_File> class following, or more than one, respectively.



This attribute provides a place for free-format text description of this particular format of the document. For example, if this format resulted from scanning a paper copy, you can use this attribute to mention than and credit/thank the source.

Description of the document content (that is, the logical content, not the physical file structure), should be in the <Document> class.



This class identifies and describes one of the files comprising this particular physical form of the document. There must be one <Document_File> class for each file in the format described by the <Document_Format_Set> containing this class.



The name of the file being described, without any directory path information (which can be included below in directory_path_name, if needed). The name is case-sensitive.



This attribute holds a simple identifier to be used to cross-reference this file description from elsewhere in the label. This is required for the first file (the file a user should examine first) of a document format that contains multiple files. It is optional in all other cases. Case is significant for this as well.



If you would like to document the creation date of the file named in file_name, this is the place to do it. The date and (optional) time must be in the ISO 8601 standard format.



The size of the file, in bytes. It must be in bytes; must not contain any punctuation ("12345", not "12,345"), and should be accurate to the byte.



This is the total number or records in this file. Note that the concept of "record" is not defined. In a flat text file, this is usually taken as the number of lines delimited by carriage control (which can vary). In binary files, this may be something like the count of rows in a binary table, provided the file only contains one data object. In other cases this cannot be defined.



If you prefer to track the MD5 checksum of a data file in its PDS label, here's an attribute to hold it. In this context it must the be MD5 checksum of the indicated file as whole - not a part of the file.



This attribute provides a place for free-format text to add any additional explanation or credits relevant to this particular file.



If directory path information is needed to find the named file (that is, if it is not in the same directory as the label describing it) the path relative to the label file goes here.

The file must be in either the same directory as the label or a subdirectory of that directory. Paths should follow the Unix/Linux convention and use '/' as a level separator; case is significant. Do not include the file name with the path information.

Note: The details of the format for this value are not defined in the PDS4 Data Dictionary or Standards Reference Release 1.0, nor are there any format constraints in the data type definition in the schemas. Although a "ASCII_Directory_Path_Name" type is defined in the XSD schema, it does not constrain the format of the field beyond the requirement that it contain ASCII, and even then that data type is not used to define the <directory_path_name> attribute.

I think requiring Linux path rules is the intention, but I'm not sure. Notwithstanding, always use Unix/Linux-style paths for data coming into the SBN so we have a consistent base to work with.



This required attribute should have a value from the standard value list you can find on the Standard Values Quick Reference page.