Universally Accessible Data Publications Pilot

Public Pilot Project : (formatted to match the current Pilot Project Request Form (http://goo.gl/forms/uObA1cDJIUE02gqz2))

Status: Work ongoing, latest progress presented at NDS 7th meeting, next steps identified.


Project Title:  Universally Accessible Data Publications Pilot

Contact person for the NDS Pilot submission (First and last name): James Myers

Email address for the contact person: myersjd@umich.edu

Open group email list: https://groups.google.com/forum/#!forum/universally-accessible-data-publications-pilot 

If you have co-leads on this project, provide their names and email addresses.

Sharief Youssef  sharief.youssef@nist.gov

Ray Plante raymond.plante@nist.gov

Additional Participant Contacts:

Rebecca Koskela (DataONE) rkoskela@unm.edu

Kyle Chard (Globus) chard@uchicago.edu

Mike Conway - DFC Project michael.c.conway@gmail.com

Gretchen Greene gretchen.greene@nist.gov

Why is this project pertinent?

Making a broad range of published data accessible across the ecosystem of tools and services is integral to the NDS Vision. This pilot project, open to all members of the NDS community, intends to reach a consensus on a standard mechanism(s) for any application or service to access the full set of data and metadata comprising a data publication, regardless of source, and to work individually and collectively across the NDS Consortium to implement those interfaces within existing tools. Further, the pilot proposes the development of a suite of test data and tools to assess conformance to the agreed standards.

These mechanisms would be an important initial step in enabling data to flow seamless across tools and services in a National Data Service ecosystem. Our group anticipates that the proposed work will enable a broad range of further efforts, from assessments of the variations in data types, formats, size, purpose, metadata vocabularies across disciplines and tools, to efforts that would define defining minimal metadata standards that would enable interoperability at the deeper levels required for functionality such as federated faceted search or data fusion. Thus we anticipate direct and indirect benefits for NDS.

What community does this project serve?

The project's direct community is NDS itself, At the NDS 5 and 6 meetings, a significant number of individuals (10+), representing a broad range of NSF DataNet, DIBBS, and other projects, expressed interest in participating in an interoperability effort in NDS that this pilot would address. These projects span a wide range of disciplines and include projects that provide general, community-agnostic services.

What is the size of the community?

The pilot project is open. The active communities using tools or services provided by the initial project membership is expected to represent hundreds to thousands of users. Projects/organizations that have confirmed their intent to participate include DataONE, Globus, NIST/Materials Registry, and SEAD.

Can you describe the use case or use cases?

The family of use cases targeted by this group would include those in which a researcher, through some means, has identified a potentially relevant data publication, identified by a persistent identifier of some type, and wishes to inspect its contents and, if the data is indeed relevant, retrieve all or some of the sub-components of the publication for further processing in other inter-operable tools or services. The tools may support inspection,  visualization, analysis, modeling, analytics, publication or other actions relevant to scientific research. The focus of the project would be to define mechanisms to 'resolve' an identifier, regardless of type, to discover a common API or export format from which it would then be possible to identify and retrieve the components of the data publication and the metadata associated with the publication and with specific components. The degree to which the components and metadata retrieved can be interpreted, while important for deep interoperability, is not a focus of the pilot beyond potential tasks to document current/best practices. (A quick example: a researcher seeks a GeoTiff file for display in a mapping application from within an identified data publication. The result of this pilot would enable the researcher to find the components within the publication and look for metadata, such as Dublin Core title, description, and format information which, if supplied, would let them identify the correct file to display. 

One important use case within this family would be to allow the transfer of a data publication from one system to another, e.g. to allow a publication to outlive the original software used to publish it. This group would not deliver such a capability directly, but would expect that demonstrations of such a capability could be made using the proposed access mechanisms.

What is the data need being addressed?

The use cases addressed would support a variety of needs including

    • the types of data transfer between tools described in the NDS Vision and vision video,
    • the need within NDS Share to be able support continuing data publication and use as individual software components and compute/storage resources join or leave (i.e. the proposed work would simplify transferring data between resources if/when needed), 
    • the expressed desires of NDS participants to be able to make their tools more inter-operable without having to make bespoke agreements between individual projects, and
    • the interest of researchers in being able to combine multiple services provided by NDS participants into their overall research workflows.

Which existing tools/platforms/content sources do you aim to use? What part of these can you supply. and what do you need from the NDS?

The pilot participants will supply software, services, and test data developed through their ongoing projects. Given the range of DataNet, DIBBs, repository, and other projects participating in NDS, we anticipate considerable variation in the size/scope/architecture of individual tools. We also anticipate that participants in the pilot project will be developing extensions to their existing products and/or separate tools for generating, inspecting, and parsing information on data publications provided through the API(s) or export format(s) developed during the pilot. 

We anticipate that the use of NDS Labs resources as a place to run some of the participants applications/services, to share tools developed through the groups work, to store suites of test data and the results of interoperability tests, etc.At present, we anticipate that participants would primarily want to run one or a few related applications during their testing and to be able to read and write to a shared test data suite (which may consist of exported/serialized data publications or a service providing an API over test datasets). We anticipate that this use would primarily consist of dockerized components that could be spun up as needed for development and testing rather than large and/or long-lived service instances that would represent an ongoing use of resources. As a starting point, we would anticipate that persistent storage on the order of <1 TB would be sufficient for the pilot). (Large, real world data publications that would be of interest to the pilot are expected to be hosted in external repositories and/or in NDS Share (i.e. based on separate arrangements made between the creators of the publication and NDS). We would request that NDS Labs provide login credentials and a minimal capability to launch and run applications to pilot participants upon request. We anticipate that some groups may be interested in specific dockerized containers, but the pilot as a whole does not have any request for support of any specific components and expects to rely on the base capability to create and launch containers and data volumes. Best-effort user support for using NDS Labs would be valuable. 

The pilot group will also be producing group documentation and could potentially leverage the NDS Wiki if that could be made accessible to group members.

Describe what you propose to do in terms of interfaces, architectures, and other technological/design aspect for both existing/to be created components.

The specific API and or export formats produced by the group and the details of test data suites and stand-alone software tools and libraries will be developed through the project's activities. Initial discussions have focused on the RESTful and often JSON, or JSON-LD-based interfaces that are common to many modern data applications/services and on the combination of BagIT and OAI-ORE standards for serializing and structuring data and metadata that are likewise used by multiple NDS participants. We thus anticipate that these are likely to among the proposed directions for the pilot's efforts. (Note: The technologies mentioned are useful but not sufficient for creating the level of interoperability proposed.) The pilot does intend to focus tightly on the idea of an exchange API and/or serialization format rather than the architecture within given software components or the specifics of any service that may support the API and/or format.

Are there other projects that are trying to provide a similar service solution? If so, what are these projects and how is this one different?

We are aware of a working group within RDA  ( https://rd-alliance.org/group/research-data-repository-interoperability-wg/case-statement/research-data-repository), as well as a wide range of past and current discussions on data interoperability within, for example, Earth Cube. There are also specific tools - from identifier minting and discover services, to RDA products (e.g. type registry), and vocabulary mapping services (e.g. the Earthcube Geosemantics service). We anticipate the primary differences in the proposed pilot are the focus on the mechanics of getting/putting data and metadata to read/create data publications rather than debating minimal metadata standards and the emphasis on working code that will support an initial level of interoperability across the software and services of NDS participants.

We also anticipate that bespoke interoperability efforts between different DataNet/DIBBS/other projects, being pursued as part of those projects existing plans, would provide relevant guidance for this effort and could potentially be leveraged in support of the pilot's efforts. There has also been work at a previous NDS Hackathon to demonstrate a 'universal resolver' that could provide data publication metadata in a common JSON format across different identifier schemes that, at a minimum, helps demonstrate the concept.

There were discussion groups at the 6th NDS Workshop involving functionality that would standardize the availability of specific metadata as part of an overall goal (e.g. to enable discovery of a compute-local data copy to enable efficient data analysis). These groups could coordinate with this effort, defining specific metadata that could/should be sent for their purpose(s) or may end up defining related mechanisms. We would hope that direct links with such groups, as well as indirect links through the TAC Interoperability Task Force would help harmonize these efforts. Some such efforts may be undertaken as part of this pilot, e.g. for simple things such as defining a way to show human-readable labels where we might survey current practice and recoomend that participants supply a Dublin Core title or RDF Label metadata as a short, human readable name and thus allow tools to automatically display a useful name. 

If appropriate, include a review of your pertinent previous work, including content repositories, pieces of software or other tools you believe to be important to the execution of your project.

We expect that pilot participants will help develop this background material as the project proceeds. We anticipate that most participants will already have some form of API and or export format from their prior work and that these in turn will reference underlying standards and software components. Through discussions at NDS 5 and 6, a number of candidate technologies and relevant standardization and implementation work has been identified, These include, for example, APIs (OAI-PMH, RESTful/Linked Open Data, SHARE Notify) and serialization formats (JSON, JSON-LD, RDF, XML, BagIT), data models (Portland Data Model, OAI-ORE, W3C Provenance), and services (RDA services, BrownDog, Geosemantics), There are also a broad range of potentially relevant RDA groups including those developing minimal metadata standards, registries, and terminology.

Provide a description of the proposed project activities including a project timeline.

Plan:

The group agreed to move forward with a formal pilot project proposal at NDS 6. It will be open and will solicit participation from the NDS community at large. This set of wiki pages and a Google Group email list have been established. We expect to send an open invitation to NDS participants based on the initial plans presented here and developed in the wiki.

The initial proposed scope is to address the following 4 tasks:

In the first month, the group will collect documentation of existing APIs and serialized formats in use today. The group will also document the range of options that could be pursued to address the four tasks above. As part of this effort, the group will look at available standards, interoperability mechanisms that have been proposed through RDA (e.g. https://rd-alliance.org/group/research-data-repository-interoperability-wg/case-statement/research-data-repository), Earth Cube, or other forums.

The group will plan a series of online discussions to evaluate the options and work to achieve a consensus recommended data interoperability mechanisms. As necessary, group leadership will focus the effort around a minimal set of options based on a consensus across projects that are willing to implement them and an assessment of whether initial implementations can be developed within 6 months.

The project will leverage common software engineering tools (e.g. GitHub) and the cloud resources available through NDS Labs as applicable (e.g. to store test data and share prototype code and testing tools). The group does not anticipate needing staff time from NDS beyond basic support in using the NDS Labs cloud capabilities. (NDS Labs staff would be welcome to participate in the group, contributing time and effort on the same volunteer basis as other participants.)

Products:
    • Documented mechanism(s) for data/metadata access that will include an internet-accessible programming interface(s) and, potentially, a serialized export format and/or container/library-level mechanisms
    • Two or more interoperable tools/services that demonstrate the ability to read data from multiple compliant sources
    • A suite of test data publications that can be used by cyberinfrastructure developers to test their implementations
    • A tool(s)/service(s) that assess the compliance of results returned through the defined mechanisms
    • A written report documenting the decision process and final result, and including an initial assessment of potential next steps