DataCite Schema

From The SBN Wiki
Revision as of 19:18, 20 March 2019 by Raugh (talk | contribs) (: Update)
Jump to navigation Jump to search

Analysis and summary of the DataCite DOI database schema version 4.1 and the accompanying documentation, prepared with PDS data set applications in mind.

Note: Green notes indicate recommended values based on (so far limited) node discussions at SBN. Additions and suggestions are very welcome - please contact Anne Raugh.

<resource>

ROOT

This is the root of the submission file document, and thus required.

Note that the content of the <resource> element is defined under an xs:all group, so that the immediate child nodes can appear in any order. The order here reflects the order in which they are listed in the schema.

<identifier>

REQUIRED

This is the assigned DOI identifier and must begin with the DataCite prefix "10."

Required attribute: doiType. This must have a value of DOI. For example:
     <identifier doiType="DOI">10.12345/abcde</identifier>

<creators>

REQUIRED

This is the author list or equivalent. Creators should appear in priority order. Both individuals and institutions/organizations can be credited as creators.

If the only list we have is a list of editors, then it goes here, but also indicate in the <contributors> section that each person in the list is a contributor with the role of Editor. This is not an optimal solution, but it will do for now.

<creator>

REQUIRED, repeatable

This class contains the identifying information for one creator. It is repeated for each creator. Note that the metadata schema does not distinguish between "authors" and "editors". If some of the creators listed are editors rather than authors, repeat their creator information in the <contributors> section and indicate a role of "Editor".

Fill out as much of this class as possible with information that was apropos at the time of the publication. So, if you know give name, include it; if you know the institutional affiliation, include it; if you can find and confirm the correct ORCID, include it; etc. This all helps in linking the data set into the literature and the author's network.

<creatorName>

REQUIRED

The string representing the name as it should appear in citations. For personal names, the format is "Family, Given" (all one string). Full first names are preferred for the metadata, in order to help identify authors unambiguously. (It is also relatively easy to turn a complete name into an initial.) For organizational names, use the full, formal name of the organization.

This tag has one optional XML attribute, nameType, which takes one of these values:

Organizational
Personal

Best practice is to always include this attribute for creatorName.

<givenName>

OPTIONAL

For personal names, this should contain the string corresponding to the given name in the <creatorName> element. This should not be present for organizational names (but that is not enforced by the DataCite schema).

<familyName>

OPTIONAL

For personal names, this should contain the string corresponding to the family name (surname, patronymic, etc.). This should not be present for organizational names (but that is not enforced by the DataCite schema).

Note: For PDS purposes, at least, we should consider where to associate suffixes (Jr., Sr., III, etc.). If DataCite, ADS, or the AAS journals have a convention, we should follow that.

<nameIdentifier>

OPTIONAL, repeatable

This attribute provides a formal identifier for an individual or organization - for example, an author's ORCID, or an organizational DOI. Only public identifiers should appear here, of course. If there is more than one applicable identifier, this element may be repeated.

Required attribute: nameIdentifierScheme. The value should be the common name or acronym for the identifier given (like "ORCID").
Optional attribute: schemeURI. The value should be the URI of the defining organization or schema ("http://orcid.org", e.g.).

Where it can be obtained directly from the creator, or where we can be certain about a creator's ORCID, we should add it to the metadata.

<affiliation>

OPTIONAL, repeatable

This attribute provides an organizational affiliation (as free-format text) for the creator. It may be repeated. Organizational "creators" may also have affiliations, if that makes sense.

<titles>

REQUIRED

This class lists names or titles for the resource being identified. At least one title must be provided. This is the title that will be used to format citations, and it is also the title that users will see returned by various search interfaces. It needs to make sense in a general, non-PDS search context, so avoid acronyms and assuming that users will know that, for example, "Deep Impact" is also the name of a spacecraft and NASA mission.

Remember the primary title has to make sense in the context of an ADS-type search. Users may not know that data sets are included in their result set. The title we put here will be used to formulate citations of the data set as well as helping users to identify data of potential interest. So avoid acronyms (no matter how obvious they seem now) unless they include the full name. For example, "DIF" is not good on its own, but "Deep Impact Flyby (DIF) Spacecraft" is very good - explicit and contains the acronym that knowledgeable users might search on as well.

<title>

REQUIRED, repeatable

This element contains a single title. It may be repeated for alternate titles where appropriate.

Optional attribute: titleType. This must have one of the following values:
  • AlternativeTitle
  • Subtitle
  • TranslatedTitle
  • Other

For PDS purposes, the formal title should always be listed first without a titleType attribute, and additional titles should probably always be either alternatives or translations and identified accordingly through the titleType attribute.

Optional attribute: xml:lang. This should contain one of the standard ISO 2- or 3-letter codes (but this is not validated). Note that this indicates only the language of the associated title string, not the language of the resource.

PDS documents are required to be in English, so we should have no use for this. Our IPDA partners, however, might. In that sort of context, it would be prudent to include the xml:lang attribute for all <title> elements, not just the one identified as the TranslatedTitle.

<publisher>

REQUIRED

This attribute identifies the publisher/distribute/curator of the resource. It is used in creating citations.

For PDS4 data sets that have LIDs that begin with urn:nasa:pds, this should always be NASA Planetary Data System. If we are assigning DOIs for other publishers (like ESA), we should first determine what their publisher title should be, and we should probably be using a DOI prefix that can be unique to that publishing archive.

If we are assigning a DOI and also serving the data for a non-PDS publisher, then we should list our node and facility in the <contributors> section, with a role of DataCurator. (The first time this comes up we should think about this again, just in case.)

<publicationYear>

REQUIRED

The year the resource was made available to the public. This is used in creating citations.

For PDS data sets, this must be the four-digit year in which the data were publicly posted in the format and version associated with this DOI. This may or may not be the same year as listed in the CITATION_DESC field for legacy data sets. When in doubt, assume the DATA_SET_RELEASE_DATE is correct unless you have documentation that proves it is not - and then use the date in that documentation and note the discrepancy and resolution in an additional <description> field (following) with a descriptionType of Other.

This date must agree with the Available date (see below) provided for ADS processing.

There are other date fields in which significant dates can be indicated if needed or desired.

<resourceType>

REQUIRED

This element takes a free-format text description of the type of resource associated with the DOI.

Required attribute: resourceTypeGeneral. This must have one of the following values:
  • Audiovisual
  • Collection
  • DataPaper
  • Dataset
  • Event
  • Image
  • InteractiveResource
  • Model
  • PhysicalObject
  • Service
  • Software
  • Sound
  • Text
  • Workflow
  • Other

Best practice generally is to consider the resourceTypeGeneral as the broader term which is then modified by the value string, so that a classification can be formed by concatenating the two with '/'. So, for example:

    <resourceType resourceTypeGeneral="Dataset">PDS4 Data Collection</resourceType>

would read as "Dataset/PDS4 Data Collection".

Best practice for "Text", specifically, is for the value to be taken from the CASRAI dictionary "Output Types" Sub-Element list at http://dictionary.casrai.org/Output_Types.

Here are the values to use for SBN cases:

DOI target resourceTypeGeneral resourceType
PDS3 data set Dataset PDS3 Archive Data Set
PDS4 data product Dataset PDS4 Archive Data Product
PDS4 collection product Dataset PDS4 Archive Collection
PDS4 bundle product, collections do not have DOIs Dataset PDS4 Archive Bundle
PDS4 bundle product, collections have their own DOIs Collection PDS4 Archive Bundle


Note: Looks like DataCite would like to be consistent with Dublin Core usage for these terms, and that would make sense for us as well - but decisions here will have consequences elsewhere in the database. Consistency across PDS would be highly desirable here.

<subjects>

OPTIONAL

This element lists keyword-type classifications as are commonly associated with journal articles.


<subject>

OPTIONAL

This element provides a string that corresponds to a keyword or similar classifier for the resource. It may be repeated as desired. Each occurrence should contain only a single taxonomic-type entry, and the taxonomy should be indicated via the optional attributes as far as possible.

Optional attribute: subjectScheme. This should be the name of the taxonomy or authority. There is no controlled value list.
Optional attribute: schemeURI. This should be a reference to the taxonomy definition or reference site.
Optional attribute: valueURI. If there is a URL, for example, for the definition of the specific term being used, include it here.
Optional attribute: xml:lang. Use this attribute to provide the standard ISO abbreviation for the language of the term.

PDS really should find a reference taxonomy to use specifically for this field and the corresponding label fields. There are one or two viable candidates in community use, and it is not at all clear that anyone would benefit from us creating a new one.

Note that for hierarchical taxonomies, a single instance of <subject> should express the entire hierarchy as a single string in the appropriate notation - there is no implied relationship between <subject> elements.

<contributors>

OPTIONAL

This element provides a means for identifying people and organizations, other than the previously identified <creator>, who contributed to the creation, management, curation, distribution, etc., of the resource being described.

<contributor>

OPTIONAL

This element identifies a person or organization who made or makes some contribution to the resource. There is a required attribute to define the type of contribution. The element may be repeated as needed.

Required attribute: contributorType. This must have one of the following values:
  • ContactPerson
  • DataCollector
  • DataCurator
  • DataManager
  • Distributor
  • Editor
  • HostingInstitution
  • Producer
  • ProjectLeader
  • ProjectManager
  • ProjectMember
  • RegistrationAgency
  • RegistrationAuthority
  • RelatedPerson
  • Researcher
  • ResearchGroup
  • RightsHolder
  • Sponsor
  • Supervisor
  • WorkPackageLeader
  • Other

These are all defined in the appendix to the DataCite schema description document.

PDS needs to define its own usage and interpretation of these terms for uniform application across nodes. This should be considered fairly urgent, so we can incorporate this from the beginning in our DOI database info.

<contributorName>

REQUIRED

The name of a single person or organization contributing. As in the case of <creator>, this should be in the format "Family, Given" for personal names, and the formal name for organizations.

Optional attribute: nameType. It must have one of these two values:
  • Personal
  • Organizational

Best practice is to use the optional attribute.

<givenName>

OPTIONAL

The given name of a personal name, analogous to the same field for <creatorName>.

<familyName>

OPTIONAL

The surname or patronymic of a personal name, analogous to the same field for <creatorName>.

<nameIdentifier>

OPTIONAL

A formal identifier for a person or organization, such as a personal ORCID or an organizational DOI. It may be repeated if there is more than one applicable identifier.

Required attribute: nameIdentifierScheme. This is the type of the identifier ("ORCID" or "DOI", e.g.).
Optional attribute: schemeURI. This is a URI reference to the identifier definition or defining organization.

<affiliation>

OPTIONAL

This element contains the name of an organization or institution with which the named contributor is affiliated. It is a free-format text field. It should be repeated for each unique affiliation when there is more than one.

In the archiving case, this must be interpreted as "affiliation at the time of publication", which is fine for everything except ContactPerson. We should consider what makes sense for this role if it is used, or if we should simply forbid its use altogether for archival submissions.

<dates>

OPTIONAL

This element provides a way to include various significant dates in the DOI database record.

<date>

OPTIONAL

One significant date for the resource. Dates should be in ISO 8601 format and can be to any precision (but this is not schematically enforced). This element may be repeated as needed for each date.

Required attribute: dateType. This indicates the significance of the date and must be one of the following values:
  • Accepted
  • Available
  • Collected
  • Copyrighted
  • Created
  • Issued
  • Other
  • Submitted
  • Updated
  • Valid

These are defined in the DOI Schema description document.

Optional attribute: dateInformation. This should be a very brief clarification of the dateType, where necessary.

PDS needs to consider carefully how to use the dateType values and possible dateInformation, especially in the context of the larger DOI and ADS databases. It's particularly important to get these right.

<language>

OPTIONAL

The natural language of the resource. This is defined as being of type xs:language, which provides syntax validation but does not actually fully enforce that values come from the "IETF BCP 47, ISO 639-1 language code," as specified in the description.

<alternateIdentifiers>

OPTIONAL

This element lists alternate identifiers for the same instance of the resource (as opposed to physically distinct, duplicate copies with their own identifiers). The identifiers should be unique and controlled within some context which should be specified.

From the way this is described, it sounds like the PDS4 LIDVID would be an "alternate identifier". But typically we would want the LIDVID in a citation, so that needs some consideration.

<alternateIdentifier>

OPTIONAL

This element provides one instance of an alternate identifier for the resource. It may be repeated as desired.

Required attribute: alternateIdentifierType. This string must describe the source or context of the identifier.

PDS should probably define standard values for alternateIdentifierType for PDS identifiers and for archive consistency with external identifiers in those cases where listing an external identifier here would be appropriate.

<relatedIdentifiers>

OPTIONAL

This element lists identifiers for other resources related to this resource in some specific way.

<relatedIdentifier>

OPTIONAL

This element is a single related identifier. It can be repeated as needed for additional identifiers. Note that it has half a dozen attributes, only two of which are required, to help in defining the relationship. Standard values are defined in the DataCite schema documentation.

Required attribute: relatedIdentifierType. The value must come from the following list:
  • ARKarXiv
  • bibcode
  • DOI
  • EAN13
  • EISSN
  • Handle
  • IGSN
  • ISBN
  • ISSN
  • ISTC
  • LISSN
  • LDIS
  • PMID
  • PURL
  • UPC
  • URL
  • URN
Required attribute: relationType. The value must come from the following list:
  • IsCitedBy
  • Cites
  • IsSupplementTo
  • IsSupplementedBy
  • IsContinutedBy
  • Continues
  • IsNewVersionOf
  • IsPreviousVersionOf
  • IsPartOf
  • HasPart
  • IsReferencedBy
  • Referencecs
  • IsDocumentedBy
  • Documents
  • IsCompiledBy
  • Compiles
  • IsVariantFormOf
  • IsOriginalFormOf
  • IsIdenticalTo
  • HasMetadata
  • IsMetadataFor
  • Reviews
  • IsReviewedBy
  • IsDerivedFrom
  • IsSourceOf
  • Describes
  • IsDescribedBy
  • HasVersion
  • IsVersionOf
  • Requires
  • IsRequiredby

Once again, PDS needs to give some institutional thought to how to use these both in initial submissions, and how to keep them updated when related products are tagged with DOIs.

Optional attribute: resourceTypeGeneral. This attribute is identical to the one of the same name in <resourceType>, above.
These optional attributes should only be used when the value of the relationType attribute is either IsMetadataFor or HasMetadata (this is not validated):
Optional attribute: relatedMetadataScheme. This indicates the ID or name of a metadata definition standard.
Optional attribute: schemeURI. This should be the URI of the named metadata standard.
Optional attribute: schemeType. The DataCite definition is not clear, but this looks like a specific file format type for the referenced metadata standard (such as "XSD").

<sizes>

OPTIONAL

This element provides unstructured size information. In other words, it is not required to be numeric and there are no syntax constraints on the content.

<size>

OPTIONAL

A single size specification string, like "18GB" or "Three volumes". This element may be repeated as needed or desired.

PDS should consider making systematic use of this field. Might be helpful to users coming at the data from the publication side.

<formats>

OPTIONAL

This class indicates the physical/digital format(s) of the resource.

<format>

OPTIONAL

This element contains a text description of the format. It is not constrained.

Best practice is to use a file extension or MIME type string as the value.

PDS should be more formal about the content here. We also need to think about possible mixed-format products, like single documents that comprise multiple files, some of which are text and some images/graphics, and the best way to describe format for collections (if at all).

<version>

OPTIONAL

A version number associated with the resource.

Best practice is to obtain a new DOI for a major version change.

Because of the traceability and reproducability concerns involved in research data, PDS should almost certainly forbid the use of this element. That is, it should be a PDS requirement that new versions of PDS products with DOIs should get their own DOIs. The <relatedIdentifiers> element can be used to link the two versions in the DOI database.

<rightsList>

OPTIONAL

This element typically contains only a single <rights> member to indicate the rights management for the resource, although it may contain multiple <rights> elements in complex cases.

<rights>

OPTIONAL

A single rights license with management information (e.g., "Creative Commons", or "GNU General Public License"). This should be as explicit as possible, with a complete management statement where appropriate. Embargo information should also be recorded here.

Optional attribute: rightsURI. This should be the URI to the full text of the license.
Optional attribute: xml:lang. This indicates the language of the license.

PDS data is public domain (which may or may not be worth stating explicitly), but we may need to consider the case of embargoed data if we are planning to reserve DOIs in advance of publication.

<descriptions>

OPTIONAL

This element provides a place for additional information that does not fit into other categories.

<description>

OPTIONAL

A single, free-format description of the specified (via attribute) type. This field may be repeated as needed.

It looks like formatting is not preserved for this text, but you may use the <br/> tag to insert a paragraph break.

Best practice is to provide at least one description of some type. It is probably not a good idea to provide multiple <description> elements with the same descriptionType value, but this is not validated.

Required attribute: descriptionType. This indicates the category of information being provided. It must have one of these values:
  • Abstract
  • Methods
  • SeriesInformation
  • TableOfContents
  • TechnicalInfo
  • Other

For PDS, Abstract is a no-brainer. TechnicalInfo might apply if, for example, we want to note that a data collection consists of FITs files, or 3D images, or has both images as well as an observation summary table.

Optional attribute: xml:lang. This indicates the language of the description being provided, not of the resource.

<geoLocations>

OPTIONAL

This element is used to define relationships (either on the creation or application side) between the resource and a defined patch on the surface of the Earth.

<geoLocation>

OPTIONAL

This element defines one specific patch of Earth where the data were taken or on which the resource is focused. It may be repeated as desired.

<geoLocationPlace>

OPTIONAL

A name for the location being defined.

Note that the text provided in the schema that describes the points comprising the following elements as having values which are "a single latitude-longitude pair, separated by whitespace". This is not true. In all cases there are tags specifically defining latitude and longitude as separate and distinct elements.

It is also possible to define any single <geoLocation> as being simultaneously a single point, a single box, and any number of polygons. This seems irrational, yet it appears to be deliberate.

<geoLocationPoint>

OPTIONAL

This element specifies a single point on the globe.

<pointLongitude>

REQUIRED

Longitude in degrees in the range +/- 180.

<pointLatitude>

REQUIRED

Latitude in degrees in the range +/- 90.

<geoLocationBox>

OPTIONAL

A box is defined by its four sides - east and west longtude, and north and south latitude.

<westBoundLongitude>

REQUIRED

Westward bounding longitude in degrees in the range +/- 180.


<eastBoundLongitude>

REQUIRED

Eastward bounding longitude in degrees in the range +/- 180.

<southBoundLatitude>

REQUIRED

Southward bounding latitude in degrees in the range +/- 90.

<northBoundLatitude>

REQUIRED

Northward bounding latitude in degrees in the range +/- 90.

<geoLocationPolygon>

OPTIONAL

This element defines an arbitrary polygon as a sequence of points around the perimeter in which the last point must have the same definition as the first point (though this is not validated, nor is the nature of the path). There must be at least 4 points provided.

Oddly, the schema allows this element to be repeated.

<polygonPoint>

REQUIRED

Longitude and latitude of one point on the perimeter of the polygon.

<pointLongitude>

REQUIRED

Longitude in degrees in the range +/- 180.

<pointLatitude>

REQUIRED

Latitude in degrees in the range +/- 90.

<inPolygonPoint>

OPTIONAL

If you are intending to define an area that is larger than half the total surface of the Earth, then you must use this element to define a point somewhere inside the area of interest. Otherwise the smaller enclosed area is assumed to be the area of interest. The actual point can be random, as long as it is inside the intended area.

<pointLongitude>

REQUIRED

Longitude in degrees in the range +/- 180.

<pointLatitude>

REQUIRED

Latitude in degrees in the range +/- 90.


<fundingReferences>

OPTIONAL

This element identifies sources of funding related to creating or maintaining the resource.

This is not information that PDS has traditionally collected, but as it is used to trace results back to funding in the literature databases, we probably should. In the case of ROSES data preparers, I'd make that "definitely should".

<fundingReference>

OPTIONAL

This element identifies a single source of funding. It may be repeated as needed.

<funderName>

REQUIRED

Name of the funding source. This should be the formal name.

<funderIdentifier>

OPTIONAL

This is an string that uniquely identifies a funding source under some public scheme, like "Crossref Funder" or ISNI.

Required attribute: funderIdentifierType. This string is the source of the corresponding identifier, e.g., "ISNI".

<awardNumber>

OPTIONAL

Grant number or similar code assigned by the funding organization.

Optional attribute: awardURI. This attribute can be used to provide a link to a page at the funding organization website that describes the award/grant program.

<awardTitle>

OPTIONAL

The title on the grant/award - that is, the title of the proposal that was submitted and funded.