Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Resource: NDS 7 Poster describing SEAD's current practices in terms of the UADPP tasks/goals.


SEAD (http://sead-data.net) is project originally supported through the NSF DataNet program that provides services to manage and curate active data and to publish organized collections of data (referred to as Datasets) to longer-term repositories. SEAD's primary APIs support the ongoing management of data and metadata in project spaces and the publication process. However, SEAD has also developed a simple and scalable reference repository that makes it possible to create and serve Datasets ranging from a few files to 100K files, and hundreds of GB of data with a small Java web application and a basic Java web server/file system. This reference repository shares the open metadata model of SEAD (users can add metadata in any vocabulary) and uses the OAI-ORE json-ld metadata format used in SEAD's publishing services as the long-term serialization format. (When SEAD sends data to other repositories, they can parse the metadata sent and store and serve it via their own mechanisms). The following sections describe the API developed to serve the metadata of publications in the reference repository (which is used internally to dynamically generate the Dataset's landing page), and the serialization format itself.

...

https://sen.ncsa.illinois.edu/refrepository/api/researchobjects/tag:sead-data.net,2015:RO_16UNEIBDgnoe2dCBmNFTaQ/metadata/tag%3Asead-data.net,2015%3Adata_lseLuBR1WmC4xOO6XRFMTQ%2Fv2 is an example of a file's metadata which includes a URl from which the file bytes can be retrieved (the ORE similarTo key).

 In the SEAD landing page, an ajax call is made to retrieve the dataset's metadata which is used to populate the page and show the top level of a table of contents, A user click on any folder causes an ajax call to retrieve the next level of the folder tree and clicking on any file name makes a call to download the file contents itself. There are also buttons on the page to retrieve the complete serialized publication (often very large) or the complete OAI-ORE metadata map (still potentially 10-100MB for a large collection of 10-100K files). Since the SEAD reference repository only supports open (creative commons) licenses, no login as required and the endpoints can be called directly by third parties.

Serialization Format: OAI_ORE + BagIT

The SEAD reference repository, and the related repository used to support the Indiana University SEAD Cloud repository (http://seadva.d2i.indiana.edu:8081/landing-page/home.html), serialize data publications as a single file which is accessed by a web application to serve metadata and data components through the API described above. The data publication is archived as a single Zip file, formatted according to the BagIT specification, with metadata documented according to the Open Archives Initiative Object Reuse and Exchange (ORE) specification (serialized as JSON-LD, and integrated with BagIT according to the convention developed within the DataOne project.)

In quick summary, BagIT defines a file/folder structure that cleanly separates data (in an <id>/data subdirectory) from metadata (in an <id>/meta subdirectory) and defines a few standard metadata files that report the version of BagIT, basic metadata about the publication, cryptographic hash values for included files to support fixity checks, etc. OAI-ORE provides a vocabulary that can be serialized as rdf or json-ld that cleanly separates the description of a Dataset (an 'Aggregation') from the file describing it (an ORE Map file) and defines terms to indicate that the Aggregation 'aggregates' a set of "Aggregated Resources" (files and folders for SEAD). Both BagIT and OAI-ORE contain more functionality and optional features that can be found in the references given above. DataOne has defined an additional metadata file (pid-mapping.txt) that can be added to a BagIT bag to indicate the mapping between IDs used in the ORE metadata file and the data path structure in the bag (e.g. tag:12356 <dataset id>/data/folder1/file2 indicates that the tag:12356 identifier is used to report metadata about the file located in the indicated subfolder of the bag).

...