Difference between revisions of "SBN DOI Procedures"

From The SBN Wiki
Jump to navigation Jump to search
 
Line 92: Line 92:
  
 
== So - How Do We Do This? ==
 
== So - How Do We Do This? ==
 +
 +
The process for collecting DOI metadata for new archive submissions is, for the time being at least, largely manual. Large data preparers may well be able to automate parts of their own metadata collection if they plan to do that as part of pipeline development. The manual process for small data preparers should be manageable as long as it is not squeezed into the night before a review deadline.
 +
 +
Here's the process in a nutshell:
 +
 +
# Ask for a draft DOI. This might happen while you're working with SBN on designing your archive and writing the control documents, or it might be an email message to SBN with the minimal data needed to draft a DOI (above).
 +
# SBN will reserve a DOI with the metadata supplied.
 +
# Either you or SBN (your preference) will create an XML file using the SBN template for DOI metadata and whatever metadata was supplied initially.
 +
# You update that XML file to contain the additional metadata required/desired.
 +
# Send a copy of the updated file to SBN.
 +
# SBN will do quality control on your metadata, request changes if needed, and use it to update the metadata in the DataCite database.
 +
# When the data are published, SBN will publish the DOI.
 +
 +
Steps 4-6 can be repeated as often as needed.

Latest revision as of 19:09, 8 June 2020

This page describes the nominal procedures SBN will follow in reserving and publishing DOIs under the most common circumstances. Uncommon circumstances are, ironically, fairly common in small bodies data, so if you have one please do not hesitate to contact us.

DOI Milestones

The primary goal for tagging archive data sets is to give credit to those who produced a high-quality research data set (PDS archived data sets are refereed publications), as well as to enhance the discover-ability of the data in the larger world outside of the PDS archives. With that in mind, DOIs should be part of design considerations for data productions, and there are DOI milestones that should be included in planning for review and publication of the data.

Large data production efforts - those which produce one or more complete bundles, typically each with multiple collections, each of those with many thousands of individual products - will generally be governed by a signature document that includes the archive design. This document is often referred to as the System Interface Specification (SIS), Interface Control Document (ICD), or Data Management Plan (DMP). Small data production efforts should still have some less voluminous but equally important design and scheduling document. Key activities related to DOI generation should be included in both the design and the development schedule in the relevant document.

The major DOI-related events that should be included in project schedules are described following.

Determine which collections or bundles will receive DOIs.

This should be part of the archive design. The archive units (collections and bundles) that will be assigned DOIs should be noted in the design or control document. Consequently, authorship is an important criterion to consider for determining collection and bundle boundaries.

In general, DOIs are assigned to either a collection or a bundle, but not both. Remember that PDS DOIs are counted as refereed publications, so anything that looks like it might be artificially inflating an author's refereed publication count should be avoided. If you have a reason for wanting DOIs at both levels, let us know - it may well be appropriate and beneficial, and if so that should be included in the archiving plan.

On rare occasions you might also have a single document or product (like a press-release image) for which you would like a DOI, or an additional high-level product collection over and above what might have been in the original controlling document. SBN can accommodate these requests, with a rapid turn-around time, if needed (a publishing deadline, for example). Collect the metadata and give us a call.

Reserve the DOIs.

DOIs can be reserved with partial metadata. Reserved DOIs are not findable in public databases, but they allow the data preparers to include the DOI in the metadata for the archive unit, and the metadata XML file used by SBN to submit DOI requests is a very useful tool for collecting and validating the complete metadata required before the DOI can be published. Also, note that unpublished DOIs can, if needed, be expunged. Drafting and deleting unpublished DOIs is a relatively painless process.

Metadata requirements for reserving a DOI are covered below.

Complete the metadata.

The intention is to update the PDS4 Information Model and label structures to include as many of the metadata fields as is reasonable, but it is likely that this will never be a completely automatic process. In the meantime, SBN is using the DataCite metadata schema to define XML files containing the needed metadata, with an additional Schematron file to help enforce requirements and consistency in terminology. The sort of data that will likely need to be added manually includes such things as: contributors (beyond those who are included in the author/editor list); affiliations for creators and contributors; ORCIDs, where available; subject keywords relevant to the archive unit; data volumes; and possibly funding information, depending on how NASA evolves on that question. In addition, it will be important to ensure that titles and abstracts can logically stand alone in a non-PDS context (in an ADS listing, for example).

This should be done as part of review preparation, so that the DOIs can be published as soon as the data are accepted for archiving. SBN will not publish a DOI that does not have sufficient metadata. "Sufficient", in the case of refereed archive data, means "rich enough to meet FAIR data principles." (We realize, however, that "sufficient" may require context-dependent interpretation in some cases.)

Metadata Requirements

The ultimate goal is to collect "rich" metadata, which includes citations, author identifiers, subject keywords, and various types of description. But the collection can and generally will take place over time as the data are created and prepared for archiving.

The following descriptions refer to elements in the DataCite metadata schema. The local wiki documentation page for this is "DataCite Schema".

Reserving a DOI

The following metadata elements are required to reserve a draft DOI:

  • <title>. The title should be descriptive enough to stand along in, say, and ADS results listing. It can change - and should, if the content associated with the DOI changes.
  • <description>. This description is not the abstract and is not for public consumption. It will be replaced with the public abstract at review time. The point of this description is to provide enough information for SBN to bookkeep the DOI - to be able to map it unambiguously to the target bundle/collection/product even if the description of that target changes as the mission progresses; provide a contact point; note estimated delivery dates; and so on.

If these are known or readily available, these elements should also be supplied:

  • <alternateIdentifier> for the PDS4 LIDVID. There's a standard format for this you will see in the XML file provided by SBN, or you can get the details on the "DataCite Schema" page.

Some metadata will be filled in by SBN automatically:

  • <publisher>. This is a constant.
  • <resourceType>. This is dependent on the type of the target of the DOI - bundle, collection, or product.
  • <rights>. This is also a constant.

Additional metadata can be supplied and will be stored in the associated XML file for later use.

Preparing for Review

Most of the DOI metadata should be in place at the time the bundle/collection/product goes to review. If DOIs were not reserved prior to this point, they will be reserved now. Existing DOIs with minimal metadata must have the missing fields supplied before the data go to the reviewers. SBN will supply metadata XML files with any prior supplied or automatically generated metadata filled in.

These metadata elements are required at review time:

  • <creators>. These are the authors (or in some cases, editors). Authors can be individuals, or if preferred the author can be a team. In general, those two options should not be mixed. For each <creator>, you must supply:
  • <creatorName> in the form "surname, first_name middle_name". Use names where available, initials where not.
  • <givenName>. The "first_name middle_name" part of creatorName.
  • <familyName>. The "surname" part of creatorName.
  • <title>. As for draft DOIs (above), this must be a title appropriate for a non-PDS search context.
  • <subject>. Subject keywords, extracted from the Unified Astronomy Thesaurus (UAT - http://astrothesaurus.org/)
  • <alternateIdentifier>. The value in this case is the PDS4 LIDVID.
  • <relatedIdentifier>. This is the mechanism for noting citations/references from the data set to other data sets and the literature in the DOI metadata. In deciding what to cite here, think of the data set as a refereed publication and follow the same rules as you would for determining what to cite for an article. Use the "Cites" value for the relatedIdentifierType. Broadly, cite things that were essential to the creation of the dataset; but not things that might be essential to using the data set. Be aware that the Reference_List in the PDS label may contain "suggested reading" as well as works that truly should be cited.
For example, a calibrated data set should cite its source (raw) data set, the paper describing the calibration performed, and the calibration data set (if any). The uncalibrated data set, however, should not cite the calibration paper (in its DOI metadata) because the process described in that paper was not used in creating the uncalibrated data set.
  • <description>. More specifically, this needs to be an abstract appropriate for a non-PDS search context (like the ADS). You should include the expansion of all acronyms in the text, and do not include short-form citations (e.g., "(Raugh, et al. 2020)") unless you also include the DOI referencing that publication, in the form: https://doi.org/doi.

These metadata elements are strongly encouraged where they are known and applicable at review time:

  • For your creators (authors), also include:
  • <nameIdentifier>. More specifically, the ORCID, where one is available.
  • <contributor>. This is for acknowledging people or groups who made essential contributions to the creation of the target data but who are not in the creator list.
  • <relatedIdentifer>. In this case we seek to document known citations of this DOI target by another DOI target. Use the "IsCitedBy" value for the relatedIdentifierType. Typically this will be used in the uncalibrated data set to tie it to the parallel calibrated data set (which should, in turn, Cite the uncalibrated data). On occasion, SBN may make specific recommendations for inclusion of other relation types for less common cases found in incoming data.
  • <size>. Frequently there will be two or more different measures of "size", since the intention here is to give a human an idea of the magnitude of data involved. SBN recommends that two <size> entries be supplied: one in terms of total data volume in MB, GB, etc., and one in terms of the number of products comprising the target. Others may be added at the data preparer's discretion.
  • <format>. As with size, there will typically be two or more descriptions of format. Wherever possible, one of them should indicate the MIME type of the product files. (See the page "DataCite Schema" for examples.) A second description should be more human-friendly, for example "Fixed-width ASCII tables and raster images". For specific PDS formats, such as PDS3 images with attached labels or spectral cubes, it is appropriate to include an additional <format> with a description like "PDS3 .img files". Additional descriptions can be included at the data preparer's discretion.
  • <fundingReference>. This helps grant programs track the result of their funding. NASA Planetary is still trying to find this bandwagon.

Data Publication/Archive Acceptance

The draft DOI will not be published until the data are published. This will happen as a result of a successful review, in which the data are either certified for publication with liens (to be resolved), or accepted for archiving as is (no further editing required). The DOI cannot be published until there is a working landing page (which SBN supplies), but publication should happen as soon after that as possible - one or two working days, if many data sets are being published at once; or within minutes for smaller batches.

Updates to the DOI metadata submitted for review will be accepted. These fields considered optional for review are required for archiving (see descriptions above):

  • <size>
  • <format>

Certification vs. Acceptance

When data are certified for publication (rather than accepted as is), SBN will add a note to the bottom of the abstract description in the metadata indicating that the data are undergoing final editing, with an anticipated completion date. This note will be removed when the final delivery is received and any additional metadata updates have been supplied.

So - How Do We Do This?

The process for collecting DOI metadata for new archive submissions is, for the time being at least, largely manual. Large data preparers may well be able to automate parts of their own metadata collection if they plan to do that as part of pipeline development. The manual process for small data preparers should be manageable as long as it is not squeezed into the night before a review deadline.

Here's the process in a nutshell:

  1. Ask for a draft DOI. This might happen while you're working with SBN on designing your archive and writing the control documents, or it might be an email message to SBN with the minimal data needed to draft a DOI (above).
  2. SBN will reserve a DOI with the metadata supplied.
  3. Either you or SBN (your preference) will create an XML file using the SBN template for DOI metadata and whatever metadata was supplied initially.
  4. You update that XML file to contain the additional metadata required/desired.
  5. Send a copy of the updated file to SBN.
  6. SBN will do quality control on your metadata, request changes if needed, and use it to update the metadata in the DataCite database.
  7. When the data are published, SBN will publish the DOI.

Steps 4-6 can be repeated as often as needed.