/
Science Data Use Cases for Phase I

Science Data Use Cases for Phase I

The focus of the OSN pilot is to prototype a national-scale storage platform and network to meet the needs of the research community on all scales. Rather than research or development of completely new technologies, efforts will focus on testing, optimizing and measuring the effectiveness of existing proven storage systems and services in a variety of science-driven use cases, to help guide a broader discussion about the architecture and technical configuration of a production quality national storage network. The use cases specified will clarify the requirements with metrics to measure various solutions. The OSN will need to support an array of technologies and interfaces to meet both the science and engineering use case needs and operational requirements.

  • Connectomics (Harvard University) - The Lichtman Laboratory in the Center for Brain Science and the Molecular and Cellular Biology Department at Harvard University is embarking on a large data intensive project to image and reconstruct all the nerve cells and synaptic interconnections in a cubic millimeter of cerebral cortex. An automated pipeline of image acquisition with a fast multibeam scanning electron microscope will generate images at a rate of about 15TB per day, 7 days per week, resulting in a data set that comprises more than 300 million separate images occupying more than 2PB of storage. These data will be passed through a series of multi (G/C)PU processing steps in order to render all 1 billion synapses and all of the nerve cell wires that make up the neural network. The results of this work will be housed in the NSF-funded Northeast Storage Exchange (NESE), which is physically adjacent to the computing resources at the Massachusetts Green High Performance Computing Center (MGHPCC) that will be processing the data. Availability of the fast and convenient data sharing envisioned by the OSN project will simplify collaboration as the project progresses, increasing the impact of the research.

  • The Critical Zone Observatories (CZO), made up of 10 sites (Boulder Creek, Calhoun, Christina River Basin, Eel River, Intensively Managed Landscapes, Jemez River Basin, and Santa Catalina Mountains, Luquillo, Reynolds Creek, Susquehanna Shale Hills, and Southern Sierra) collect data for the study of the chemical, physical, geological, and biological processes that shape the Earth's surface and supports most terrestrial life on the planet. The program was created to research what scientists call Earth's Critical Zone, the porous "near surface layer" that starts with the tops of trees down to the deepest water sources. The CZO data, roughly 40TB to date, is made up of a diversity of datasets spanning disciplines, spatial and temporal scales, in situ sensors, field instruments, remote sensing, and models. By researching how organisms, rock, air, water, and soil interact, scientists hope to better understand natural habitats as well as concerns like food availability and water quality. To understand these complex reactions researchers from a wide range of disciplines including geosciences, hydrology, microbiology, ecology, soil science, and engineering are needed, with researchers in each of these domains having the ability to access and analyze this combined data.

  • Terra Satellite Data Fusion Products are derived from the petabytes of data from the five instruments on the Terra satellite (ASTER, CERES, MISR, MODIS, and MOPITT) that observe the Earth’s atmosphere, oceans, and surface to provide scientists with data on the Earth’s bio-geochemical and energy systems. Among the most popular NASA datasets, Terra serves the scientific community in areas such as ecology and geoscience towards modeling plant growth and analyzing land usage, as well as government, commercial, and educational studies and needs. The strength of the Terra data has always been rooted in the five differing instruments and the ability to fuse that data together towards obtaining a greater quality of information compared to that obtained from any individual instrument. This has become increasingly important as the data volume has grown along with the need to perform large-scale analytics on the data. These data fusion products combining two or more sensor products are largely community based and must be brought together in order for broader communities to be able to make use of them, requiring storage near computational resources connected by high speed networks in order to pull the needed raw data, generate, and share the fused products.

  • Understanding the roles of sub-mesoscale processes and internal waves in the global ocean through multi-petabyte model and observational data analysis (MIT) – The role of kilometer scale sub- mesoscale ocean processes and of internal wave processes in feedbacks to global scale ocean dynamics is a growing area of research. Ocean dynamics at kilometer scale has emerged as an important piece in improving understanding of mechanisms of ocean Earth climate modulation and in reasoning about the behavior of marine microbe communities. Many important uncertainties in ocean heat and carbon uptake, that hold deep societal relevance, have components that arise from uncertainty about kilometer scale dynamics and its interactions. Recent modeling collaborations between MIT, NASA JPL and NASA Ames have aggressively targeted production of open, shareable sets of simulation output at kilometer scale resolutions - for use in conjunction with field work and for application to theoretical analysis. These studies bring together multi-petabyte (currently 4PB) volumes of simulation output together with smaller but still petabyte volumes of field data (satellite records of sea-surface height, spectral emissivity and reflectance, in-situ genetic sequence measurements, etc.). The MIT contribution to these sharable data sets will be housed in NESE at the Massachusetts Green High Performance Computing Center (MGHPCC), using iRODS and other frameworks, under the direction of Chris Hill, a co-PI on the NESE project and a co-author of the MIT Global Circulation Model. A growing body of distributed research activities in the US and internationally are combining modeling and field work. Ability to scale to 10+PB via the NESE project, combined with the fast and convenient data sharing envisioned by the OSN project, will be transformative for this work, accelerating the research of multiple postdoctoral and graduate student projects regionally, nationally and worldwide.

  • The HathiTrust Research Center collection contains over 14.7 million volumes comprising over 5.1 billion pages, with a total size of 658TB. The HathiTrust Digital Library (HTDL), one of the largest publicly accessible collections of cultural knowledge in the world, acts as a “photocopy” of some of the great research libraries in the U.S., and as such has both wide and deep relevance to many fields—from Astronomy to Zoology, Homer to Hawking. Approximately 61% of the texts (~3.1 billion pages) are under copyright and cannot be distributed outside of HTRC. HTRC is mandated to provide non-consumptive computational research access to this portion of the collection; for the HTRC, its chief challenge is providing cutting edge computational access to this world-class collection of cultural and scientific knowledge, while at the same time rigorously adhering to copyright restrictions and ensuring security of the collection. As HPC and HTC computing go hand-in-hand with much of HTRC’s work, situating this data near computational resources benefits the user community that currently includes digital humanists conducting large-scale mining of the HTDL, computational linguists conducting natural language processing (NLP) experiments, and economists and historians of science, tracking technology diffusion and innovation over the centuries. The dataset serves as a rich and valuable resource for data science research, serving as a testbed for machine learning, NLP algorithms, and indexing methodologies. The outputs of such efforts can serve research within ecology and ecological modeling in terms of historical data otherwise not obtainable, and data fusion in areas such as geoscience where data is heterogeneous and lacking metadata or consistent metadata (a topic being addressed within EarthCube).

  • Machine Learning - Machine learning, an active area of research for some time, has exploded onto the scientific stage in the last few years due to breakthroughs in Deep Learning [Le89, Kri12] coupled with advancements in computing and an increase of available training sets. In nearly all scientific domains, machine learning is being explored as a new analysis tool for making novel discoveries. Examples include phenotyping within biology, land feature extraction within geoscience, infrastructure/community impact within engineering, novel material discovery within material science, as well as exoplanet discovery and the building of faster, more effective software for signal analysis within astronomy. A key and challenging element to contemporary machine learning, especially as applied to specific scientific questions, is the identification of relevant training datasets. Novel techniques within Deep Learning additionally require relatively large training datasets as compared to other machine learning approaches. Towards better enabling discoveries within science using machine learning, a nexus of high value, cross domain data, accessible on the proposed OSN, would be of significant value. In collaboration with the BD Hubs, the OSN will collect a variety of machine learning datasets to be shared on the OSN.

  • The Sloan Digital Sky Survey (SDSS) data, 6 million images making up 70 TB worth of data, is one of the first examples of a large open scientific data set, in the public domain, for over a decade [Sky02, Sza00]. The project and its archive have changed astronomy forever, having shown that if the data is of high quality, and if it is presented in an intuitive fashion, the community is willing to change its traditional approach and use a virtual telescope. In addition to the data itself, the data management services around the SDSS have been built up to include a scripting interface, where Jupyter notebooks can extract data from a virtual data container, with over 100TB of calibrated flat-file based data, integrated with the database, and can run machine learning applications over a subset of this data selected by a database query. To produce a timely result, such analyses need to be performed on a powerful machine; there is benefit to locating the dataset near compute resources, such as on an OSN node. The SDSS is the first major open e-science archive. Tracking its usage and changes will help the science community to understand the long-term curation of such data sets, and to see what common lessons can be derived for other disciplines facing similar challenges.

  • The Large Synoptic Survey Telescope (LSST) is the flagship project of ground-based US astronomy for the next 15 years. Its dedicated telescope in Chile will produce over 100PB of raw data, with thousands of scans of the sky. It will reproduce the sky coverage of the SDSS (collected in 8 years) in less than 4 days. It will open up the time dimension on an unprecedented scale. The data will be transferred to NCSA for processing into a Tier0 archive. The science collaborations are organized into many different working groups, by their special science interests. These working groups will produce a variety of topic-specific, value-added data sets that will be used for subsequent specialized analyses (cosmology, gravitational lensing, quasars, stellar systems, asteroids, variable stars, etc.). The relevant subsets of the raw data need to be moved around the US, to the various collaborations (Tier 1 groups) and then shared with an even broader community to do final analyses (Tier 2 centers). The considerable data movement is not part of the core MREFC project, and will require a data infrastructure. As the LSST is a good example of other data- intensive resources of the future, it is an excellent driver to use in the OSN prototype. LSST will assist in the exploration of on-demand movement of large observational data sets and their analyses, e.g., streaming a subset of simulated LSST data to multiple locations.

  • The CARMA dataset, 50TB in size, contains the archived data for the Berkeley-Illinois-Maryland (BIMA) millimeter array consortium and that of its successor, the Combined Array for Research in Millimeter Astronomy (CARMA), whose consortium included UIUC, UMd, UCB, Caltech, and U. Chicago. The CARMA array ceased operations in 2015 primarily due to the advent of the Atacama Large Millimeter Array (ALMA), but both CARMA and BIMA data remain very useful to the millimeter-interferometry community as both a scientific and technical repository. The datasets contain visibility data in custom binary formats, images in the standard FITS format, and auxiliary data in XML and ASCII text-file format. The data can be processed using community data analysis packages such as MIRIAD and CASA.

  • Integrating National Data and Watershed Models at the Process Scale: Shortcomings in the existing data and computing infrastructure available to support several current approaches to continental water models ([And06], [Kol06], [PIH11], [Tag04]) have motivated recent workshops to evaluate community model development and analysis infrastructure [CHY11], [CSD09], [CUA08], [CUA09], [NOA12]. To address modern continental-scale water science challenges, the community must come together to share resources and collaborate beyond current paradigms. A scalable, reusable, and extensible research data and analysis infrastructure for continental-scale water models is needed. Workflows that integrate existing national datasets and multiphysics models of the terrestrial water cycle provide a new community strategy for assessing the local impacts of climate and land use change in ungauged watersheds. The macroscopic representation of these processes is typically at scales ranging from sub-meter to hundreds of meters. Data in support of physics-based watershed models at these resolutions exist but not in a form to efficiently carry out the national-scale numerical computations. Geoscientists may spend weeks or months of work collecting and harmonizing the digital data necessary for a typical NSF research project, which would include soils, climate, geology, streams and river networks, etc. The basic data reside on multiple servers with different formats, resolutions and map projections. The researcher thus must decompose the topographic data into a quality numerical mesh, and project physical parameters and climate forcing data onto each mesh element. The OSN will provide harvesting and geo-referencing of the national data, developing the data products that support the modeling, and developing the API’s and web services necessary for making the data accessible for desktop download and model implementation. The proposed OSN cyberinfrastructure capacity and capability, integrated with HydroShare, will enable a variety of re- analysis products via capturing storm data, transforming the way hydrologic science is done.

  • Collaborative Gene Matching (Universities of Maine, New Hampshire, Vermont, Delaware, and Rhode Island): Research teams in Maine, New Hampshire, Vermont, Delaware, and Rhode Island engage separately in water sampling, isolation of organisms, and generation of gene sequences. Leveraging high capacity network pathways that have been deployed over the past five years (catalyzed by CC* funding), these research teams have formed a regional partnership to share data and computing resources. When data sharing between sites was introduced, participants found that unknown organisms in one partner’s database were often known organisms in another partner’s database. This exchange of information has accelerated characterization of water samples at all sites. The OSN will facilitate information sharing nationally, adding new partners and linking similar small-scale partnerships.

Related content

Open Storage Network Team Workspace
Open Storage Network Team Workspace
More like this
Towards a National Data Service
Towards a National Data Service
More like this
2018-02-01 Meeting notes
2018-02-01 Meeting notes
More like this
Container Analysis Environments (Interest Group)
Container Analysis Environments (Interest Group)
More like this
2017-09-07 Meeting notes
2017-09-07 Meeting notes
More like this