Towards a National Data Service

The National Data Service Consortium (NDSC) aims to enhance scientific reproduciblity in the Information Age by implementing interfaces, protocols, and standards within data relevant cyberinfrastructure tools/services towards allowing for an overall infrastructure enabling scientists to more easily search for, publish, link, and overall reuse digital data. Specifically, NDSC has as its mission:

... advancing the frontiers of discovery and innovation by enabling open sharing of data and increase collaboration within and across fields and disciplines. Success will be achieved through coordinated and concentrated efforts, developing an open environment of federated, interoperable, and integrated national-scale services.

...well-defined interfaces that allow for the development of modular, interoperable, and extensible services and tools – which NDS is committed to help assemble, coordinate or pioneer in collaboration with others.

One might think of this in terms of the modern internet, which had its roots in the ARPANET effort ...

... taking into account additional components addressing "Big Data" challenges within the internet, a sort of DATANET:

The above is illustrative and not a comprehensive list of services. There are a variety of components being actively explored and developed, some building on top of others, some not, some interacting with others, some not. In essence, just like with every Internet component there are a variety of options for each Datanet component. Unlike the internet components, however, these selections matter - they effect the user and limit what other components might be utilized/deployed.

For example, in the context of internet technologies, consider a web admin deploying a web server: the admin will pick one server over another for one reason or another, e.g. the open source Apache HTTP server, NGINX, or the commercial Microsoft IIS server. IDEALLY this decision will not affect users, because their browser, also a choice to be made this time on the users end, will work either way. This is the case because web servers and browsers implement similar protocols/standards (e.g. HTTP, HTML, ...).

In collaboration with groups like the Research Data Alliance (RDA) we aim to also achieve this standardization among the actively developed data components being constructed today, allowing them to be more readily put together, and allowing scientists as well as others to more easily:

publish data, link data to papers, cite data, get credit for your data
preserve data for longer periods of time
efficiently and reliably move large amounts of data and/or large numbers of files
share data collections over the web
set data access permissions
search for data
reuse data
find desired data and be able to access that data
run custom tools/software on data, near the data
run compute intensive analysis on data
run analysis on large collections of data
run analysis across any available resources
utilize an extensible suite of reusable data analysis/manipulation tools
preserve, share, and find software/tools associated with the data
publish software, cite software, get credit for your software
create and share workflows over data
ask novel questions spanning all available data

Overall this will accelerate and enable novel science, allowing stories such as these to become the norm rather than the exception:

Scientists have pinned down a molecular process in the brain that helps to trigger schizophrenia by analyzing data from about 29,000 schizophrenia cases, 36,000 controls and 700 post mortem brains. The information was drawn from dozens of studies performed in 22 countries. The authors stressed that their findings, which combine basic science with large-scale analysis of genetic studies, depended on an unusual level of cooperation among experts in genetics, molecular biology, developmental neurobiology and immunology. "This could not have been done five years ago".
Researchers discover that rising atmospheric CO2 is resulting in less nutritious food, requiring 17 years of work across the literature of several disciplines.

... and in terms of the general public and broader impact, perhaps seeding a new kind of internet all can take advantage of:

A child might ask their Apple TV or Amazon Echo the question, "What is the weather going to be like in the afternoon... 1000 years from now?". This will not result in an answer today, but theoretically could with several of the actively developed data components and interoperability between them. For example, ecological models such as ED, SIPNET, or DALEC within workflow engines such as PEcAn could pull data from Ameriflux, NARR, BETYdb, DataONE, and NEON, transfer data via Globus, convert data to model input formats via BrownDog, run models on XSEDE resources, and returns a specific result such as temperature/forecast estimate (likely several results from several models). The result can further include additional information such as "if you want to learn more" which links to summarized versions of publications for the utilized models and datasets with DOIs to published papers or datasets in Zenodo or SEAD or executable versions of the the tools themselves (possibly simplified) within HUBzero or Jupyter lab notebooks.