Resolving Metadata

Goal:

Given any persistent identifier for a data publication, be able to retrieve all of the available metadata for it without customization for the identifier scheme or source

General Approach: 

Combine one or more of the following approaches: common convention to get metadata based on existing resolution mechanisms, provide a (pluggable) resolution service, and/or provide a (pluggable) resolution library

For any of these, we need to define the syntax of what is returned (the following assume json-ld as a basic format, but feel free to propose others)

Common Convention Options:

1) HTTP Accept Header:

Summary: Given an id, resolve it using current mechanism, use an Accept:application/json-ld header to retrieve metadata

Example: 10.5967/M0NP22DZ is a DOI, http://doi.org/10.5967/M0NP22DZ can be accessed and will redirect to a URL such as http://example.org/publications/123456 which would usually show a landing page for that publication. Making an HTTP GET call to that same URL with an Accept: "application/json-ld" header could provide the metadata instead.

Pros:

Consistent with the idea that web calls retrieve different representations of things, with landing pages and metadata being two types of representation for the underlying data publication

Easy to support with most web service libraries, does not add a new URL

~Consistent with the approach used with DOIs (see http://citation.crosscite.org/docs.html)

Cons

Not directly accessible in a browser (where metadata in a json format could be pretty-printed)

Possible that Accept application/json-ld is already in use in some systems relevant in NDS?

Possible to confuse with the DOI mechanism (requesting a non-html type directly on a DOI returns metadata from CrossRef, using the accept header on the redirected URL gives the metadata proposed here)


2) <id>/metadata

Summary: Add /metadata to the landing page URL and return json-ld metadata there

Example doi:10.5967/M0NP22DZ is a DOI, http://doi.org/10.5967/M0NP22DZ can be accessed and will redirect to a URL such as http://example.org/publications/123456http://example.org/publications/123456/metadata would return the metadata

Pros:

Simple, visible

~RESTful

Cons: 

A purely human/NDS convention - unlike using accept headers, there's no web protocol reason to infer/assume that a <url> and <url>/metadata are related.

3) Cool URI patterns:

Summary: The general concept of CoolURIs can be found in: CoolURIs for the Semantic Web. In addition to advocating for HTTP URIs to be used as persistent references, the CoolURI document describes practices that enable one to navigate from a URI representing a thing (such as a person) and a description (e.g a webpage about the person) and to enable navigation to descriptions intended for human and machine reading. Of the 3 practices outlined in section 4, it looks like option 4.3 is most relevant to our use case: using a 303 redirect with content negotiation. In this practice, a URI for the dataset could be called with an HTTP Accept header for "text/html" or "application/json" (for example) and the response would be an HTTP 303 redirect to URIs for the data's landing page or a json/json-ld metadata document. This is similar to option 1, adding the idea of a 303 step. It could potentially be seen as a hybrid of options 1 and 2 - the 303 redirects could go to URIs related to the oroginal as in option 2. The CoolURI document mentions conventions such as having a .../landing.html and .../landing.jsonld URIs for the human and machine readable forms, or having .../data/<id> and .../metadata/<id> URIs (making anologies to the people examples in sections 4.2 and 4.3).

4) http://nds.org/resolve/<id>:

Summary: Given an id, URL encode it and send it to a service 

Example: doi:10.5967/M0NP22DZ is a DOI, http://doi.org/10.5967/M0NP22DZ is a DOI specific resolver that will redirect to a URL. NDS, or other entity could run a service that would accept http://nds.org/resolve/doi%3A10.5967%2FM0NP22DZ and/or http://nds.org/resolve/http%3A%2F%2Fdoi.org%2F10.5967%2FM0NP22DZ and only return metadata (one option) or resolve to the landing page or, with an Accept:application/json-ld header, the metadata  (a second option, combining the convention above with  a service), or similar options. The service would use software specific to the identifier type (e.g. DOI, Handle, etc.) and/or repository/landingpage provider (e.g. Globus, Dataverse, SEAD)

Pros:

Works with any scheme/source as long as someone writes the code for that id type/source 

Can be run as a service by any org (authoritatively by NDS or SHARE, or as a local service by any project)

Cons

URLs are ugly

Should service redirect if/when possible or always forward the requested content?

Library Options:

Propose something concrete! Should address language? extensibility mechanism? consistency with service approach?


Proposed Solution:


1) Metadata Retrieval

To be compliant with this specification, repositories must implement the ability to respond to HTTP Accept headers for the landing page they provide for data publication identifiers they mint. At a minimum, a "text/html" header should return human radable content about the data publication and an "application/json-ld" header should return a JSON-LD formatted document that includes all known metadata about the data publication. Repositories may respond with a 303 redirect rather than serving this content directly from the landing page URL. Repositories may also respond to other accept headers to provide metadata in other formats (e.g. RDF/Turtle, XML, etc.). Repositories are encouraged to adopt good practices in creating landing URLs, such as those in the CoolURI proposal (see above).

To access metadata given a persistent identifier, clients should resolve the identifier to a landing page URL via the persistent identifier type-specific resolver mechanism (e.g. using doi.org for DOIs), and the request that URL with an "application/json-ld" Accept header. The client should be prepared to follow a 303 redirect response from the server.

Justification: Group discussion to date has indicated that there are no strong preferences w.r.t. the exact mechanism selected, e.g. between the choices of ACCEPT Headers or conventions on URL naming, particularly given the low anticipated cost of adding such a mechanism over existing interfaces. Looking at the CoolURI best practices, it appears that using ACCEPT headers would be more consistent with other semantic web applications and, with the ability to support a 303 redirect, would enable applications to also adopt standard or software specific URLs in addition.

Agreement: 

Issues/Concerns:

2) Metadata format

With an expectation that some groups implementing this specification will work to harmonize their data models and metadata choices, the core specification only defines a means to distinguish three types of entities w.r.t. metadata: the metadata document, the publication, and component(s) of the publication. Using json-ld formatting, metadata providers may use any vocabulary(ies) to describe these three types of entities in the json-ld document they return. We propose the use of the OAI Object Reuse and Exchange (OAI-ORE)  vocabulary to identify and relate these three entities and the adoption of the json-ld serialization of OAI-ORE. This means:

  • The returned json-ld document contains one top-level JSON object representing the represents the document itself and is of ORE type "ResourceMap"
  • This JSON object has an ORE "describes" relationship to an embedded JSON Object of ORE type "Aggregation"
  • The Aggregation JSON Object has an ORE "aggregates" realtionship to a JSON Array of one or more JSON Objects of ORE type "AggregatedResource"

Each of these entities - the ResourceMap, the Aggregation, and one or more AggregatedResources may have arbitrary metadata that is serialized as json-ld within the basic structure outlined above.

Justification:

While current practices across the NDS consortium and across the range of systems considered in RDA groups can have significant differences in terms of the structure of their data publications, the nature of resources comprising a publication, and the types of descriptive information about them, there appears to be a consensus that publications form different sources are comprised of one or more items, and that these are distinct from any document(s) that represent them. It thus makes sense to include in the core specification, a means of identifying these entities and of navigating through a returned metadata document to discover them, independent of whether further agreement on the data model or metadata vocabularies used is possible. As an example of value, consider the inclusion of metadata in the returned document that identifies a creator, title, creation date or similar concept. Standardizing how the document, publication, and included resources are distinguished in the metadata document allows one to understand whether a given instance of  creator/creation date, etc. metadata refers to the document, the publication, or a specific resource within it, all of which could be different. As with the choice of using Http Access headers, it does not appear that any new mechanism is needed: the OAI-ORE specification addresses these issues, has a defined json-ld serialization, and, while the OAI-ORE specification includes more than just these types and relationships above, does not require use of any additional concepts. Further, OAI-ORE can co-exist with other type systems, e.g. allowing an AggregatedResource to also have the type myschema:file, prov:entity, or someDBorg:query, which would indicate model or metadata constraints to anyone knowing those types.

A valid question in deciding on a format for returned metadata (beyond the use of json-ld as a syntax) is whether further consensus on the data model and/or minimal vocabulary can be reached. This is potentially hard to answer without a concrete counter proposal, but the discussion involved in putting this proposal together noted that questions of whether data models are flat, hierarchical, hierarchical with the potential for multiple parents, graph-oriented (e.g. a provenance graph, or in a more complex example, a graph including provenance relationships along with associations with instruments, and samples, or with spatial relationships, etc. ), or whether published resources are file-like (represented by a single stream of bytes), or represent a query or workflow over a base resource, or represent live objects (e.g. streaming sensor data that will be different when retrieved at different times, are not things that we have consensus on across systems. Similarly, current practices involve use of multiple vocabularies. While it seems possible that we could on a minimal model and minimal metadata (e.g. requiring that all models must define a default hierarchy and all resources must have basic title, creator, creationdate, type metadata in some vocabulary (e.g. Dublin Core)), it seems like such agreements might take protracted discussion and result in more work for metadata providers without a clear benefit to users. (The alternative here is not to provide no guidance, but to make such extensions optional, so that, for example, hierarchical collections can be presented in a standard way but graph-oriented publications are still visible and do not have to be mapped into an artificial hierarchy unless it makes sense and is motivated by user interest.)

Agreement:

Issues/Concerns:

One potential issue with OAI-ORE is that, while it includes the notion of hierarchy - Aggregations may aggregate other Aggregations, each Aggregation must be described in its own ResourceMap. This means a publication consisting of nested folders and files could not be fully described in a single returned ResourceMap metadata document - if the ORE aggregates relationship is used to define the hierarchy. One can accept that limitation (as in DataOne current practice) or use an alternate term to define the internal structure of publications(e.g. SEAD's current practice using dcterms:hasPart relationships to define structure in flat array of ORE 'aggregated' folders and files), so this is not unworkable. Never-the-less, it does embed one (optional) model to structure data in the core specification, which could cause confusion.