SEAD Reference Repository API

Resource: NDS 7 Poster describing SEAD's current practices in terms of the UADPP tasks/goals.

SEAD (http://sead-data.net) is project originally supported through the NSF DataNet program that provides services to manage and curate active data and to publish organized collections of data (referred to as Datasets) to longer-term repositories. SEAD's primary APIs support the ongoing management of data and metadata in project spaces and the publication process. However, SEAD has also developed a simple and scalable reference repository that makes it possible to create and serve Datasets ranging from a few files to 100K files, and hundreds of GB of data with a small Java web application and a basic Java web server/file system. This reference repository shares the open metadata model of SEAD (users can add metadata in any vocabulary) and uses the OAI-ORE json-ld metadata format used in SEAD's publishing services as the long-term serialization format. (When SEAD sends data to other repositories, they can parse the metadata sent and store and serve it via their own mechanisms). The following sections describe the API developed to serve the metadata of publications in the reference repository (which is used internally to dynamically generate the Dataset's landing page), and the serialization format itself.

Example publication: http://doi.org/10.5967/M0CF9N3H

Web API

The SEAD reference repository includes 5 relevant endpoints:

URL for the landing page
Metadata of the dataset
Metadata for any subcomponent of a dataset
URL to retrieve a file in the dataset
URL to retrieve the complete metadata map (an OAI_ORE map) for the dataset.

As can be discovered from the landing page, this is a ~30GB, 12K file collection published by Bufe, Aaron.

In SEAD, a publication has an internal identifier. A DOI is minted during publication and resolves to a landing page of the form: https://<repository machine>/refrepository/landingpage.html#<internal pub id>, e.g. https://sen.ncsa.illinois.edu/refrepository/landing.html#tag%3Asead-data.net%2C2015%3ARO_16UNEIBDgnoe2dCBmNFTaQ

Minimal metadata is recorded with the DOI in the DataCite profile (Location (URL),Creator,Title,Publisher, Publication year, Resource type (Dataset)).

The landing page makes an ajax call to the API to retrieve the metadata about the publication itself and the direct sub-components of the dataset. The landing page displays a standard subset of the available metadata (which can include anything the researcher(s) decided was important). Further calls are made to retrieve information about individual subcomponents (folders or files) as needed to let users browse through the overall collection.

The first call is to a URL of the form https://<repository machine>/refrepository/api/researchobjects/<internal pub id>/metadata

In this example, the URL is https://sen.ncsa.illinois.edu/refrepository/api/researchobjects/tag:sead-data.net,2015:RO_16UNEIBDgnoe2dCBmNFTaQ/metadata (click on it to see the json-ld returned - a pretty printing browser plugin is recommended).

The return is a single json object with key/value metadata entries that may be sub-objects, strings, or arrays. The json object has a json-ld context that maps the keys to formal predicates (e.g. "Title"::http://purl.org/dc/terms/title"). Minimal metadata includes a title, abstract, and one or more creators, as well as generated metadata including information such as a link to the minted DOI, the SEAD Project Space from which the data was published, publication date, and basic aggregation statistics (#files, total size, ..). There are also a number of 'optional' fields that are usually populated including a list of Keywords, a list of items in the dataset, a contact person, etc.

In the example here, the Keyword metadata is an array of string values, the Creator field is an array containing a mixture of objects (where we know the name, email, ORCID identifier and other information about a person) and simple strings (people's names). Fields like Title are simple strings.

The http://purl.org/dc/terms/hasPart term is used to identify the direct subcomponents of the dataset by their internal SEAD identifiers (which use a tag or urn form depending on SEAD version). In the example, the Dataset contains three top level folders and a spreadsheet (with the other ~12K files arranged in a hierarchy of subfolders beneath the three top-level ones). As a convenience, we also send the basic metadata of the direct subcomponents as an array of json objects using the OAI-ORE aggregates key (more about ORE below) - this simply lets us list the title (and size for files) of the direct children to allow us to dynamically populate a table of contents tree.

There is a second endpoint of the form https:/<repository machine>/refrepository/api/researchobjects/<internal pub id>/metadata/<subcomponent id> that provides analogous information for the individual files and folders within a published dataset. In SEAD, both files and folders have basic metadata and may have additional custom metadata and relationships, SEAD only requires minimal info including title, and subcomponents (for folders) or size/mimetype, and bytestream retrieval URL for files.

https://sen.ncsa.illinois.edu/refrepository/api/researchobjects/tag:sead-data.net,2015:RO_16UNEIBDgnoe2dCBmNFTaQ/metadata/tag%3Asead-data.net%2C2015%3Acol_jn7sXNVcE718bCXiWEewLw%2Fv2 is an example of a folder within the example publication.

https://sen.ncsa.illinois.edu/refrepository/api/researchobjects/tag:sead-data.net,2015:RO_16UNEIBDgnoe2dCBmNFTaQ/metadata/tag%3Asead-data.net,2015%3Adata_lseLuBR1WmC4xOO6XRFMTQ%2Fv2 is an example of a file's metadata which includes a URl from which the file bytes can be retrieved (the ORE similarTo key).

In the SEAD landing page, an ajax call is made to retrieve the dataset's metadata which is used to populate the page and show the top level of a table of contents, A user click on any folder causes an ajax call to retrieve the next level of the folder tree and clicking on any file name makes a call to download the file contents itself. There are also buttons on the page to retrieve the complete serialized publication (often very large) or the complete OAI-ORE metadata map (still potentially 10-100MB for a large collection of 10-100K files). Since the SEAD reference repository only supports open (creative commons) licenses, no login as required and the endpoints can be called directly by third parties.

Serialization Format: OAI_ORE + BagIT

The SEAD reference repository, and the related repository used to support the Indiana University SEAD Cloud repository (http://seadva.d2i.indiana.edu:8081/landing-page/home.html), serialize data publications as a single file which is accessed by a web application to serve metadata and data components through the API described above. The data publication is archived as a single Zip file, formatted according to the BagIT specification, with metadata documented according to the Open Archives Initiative Object Reuse and Exchange (ORE) specification (serialized as JSON-LD, and integrated with BagIT according to the convention developed within the DataOne project.)

In quick summary, BagIT defines a file/folder structure that cleanly separates data (in an <id>/data subdirectory) from metadata (in an <id>/meta subdirectory) and defines a few standard metadata files that report the version of BagIT, basic metadata about the publication, cryptographic hash values for included files to support fixity checks, etc. OAI-ORE provides a vocabulary that can be serialized as rdf or json-ld that cleanly separates the description of a Dataset (an 'Aggregation') from the file describing it (an ORE Map file) and defines terms to indicate that the Aggregation 'aggregates' a set of "Aggregated Resources" (files and folders for SEAD). Both BagIT and OAI-ORE contain more functionality and optional features that can be found in the references given above. DataOne has defined an additional metadata file (pid-mapping.txt) that can be added to a BagIT bag to indicate the mapping between IDs used in the ORE metadata file and the data path structure in the bag (e.g. tag:12356 <dataset id>/data/folder1/file2 indicates that the tag:12356 identifier is used to report metadata about the file located in the indicated subfolder of the bag).

To be able to report the metadata for an entire dateset with a hierarchy of folders and files in a single ORE Map file, SEAD has adopted a convention of reporting the Aggregation as containing a flat list of aggregated resources and using the DCterms hasPart relationship to describe the hierarchical structure. This has been done because, while ORE allows an Aggregation to aggregate other Aggregations, it requires each Aggregation to have its own ORE map file. If SEAD represented folders as Aggregations, their contents would need to be described in other files (resulting in as many as ~5K separate map files in the largest datasets that SEAD has published.).Instead, we have a single map file that has a single Aggregation containing as many as 100K+ resources, with as many as 5K of those being collections that have DCterms hasPart metadata linking them to child files and folders.

Analysis

In terms of the proposed NDS mechanisms for returning metadata, SEAD's json-ld payloads are hopefully typical of what one might expect for a json-ld serialization. The specific naming of the endpoints does not follow a useful convention for relating the landing page an metadata endpoints, or for finding the endpoints of specific items in the datasets.

SEAD's use of DCterms hasPart relationships to describe hierchical content might be a reasonable choice, but Portland Data Model or ORE relationships may be more appropriate for NDS to standardize on. It will be interesting to see if SEAD's interest in packaging all metadata in a single file and hence the dual use of ORE and DCterms to describe two related hierarchies is mirrored in other systems / of interest.

While the NDS 6 discussion focused more on an API than a serialization format, it does appear that several systems (DataOne, Globus, SEAD, HydroShare) use/are planning to use BagIT + ORE and thus some combination of these (perhaps with the DataOne extension and/or other additional conventions) might be a useful serialized import/export format for data publications in NDS. (In SEAD, the API essentially breaks the ORE Map file down into small chunks focused on a specific resource, and navigates within the zipped BagIT structure to retrieve the Map file and individual data files for download), so there is a significant correlation between the API and serialized format actually stored on disk.)