Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Polar Geospatial Center (PGC)

At University of Minnesota, we choose PGC as an example of science driven use cases. The Polar Geospatial Center is an NSF/GEO/PLR funded national center (PLR- 1559691) that provides geospatial and logistical support for the NSF funded polar science community. PGC is also the NSF liaison with the National Geospatial Intelligence Agency’s (NGA) sub-meter commercial imagery program and currently holds a global collection of sub-meter imagery of approximately 3.2 PB comprised of 7.8 million scenes with an area of over 2 billion km2 that is currently increasing at a rate of 2-10 TB daily. PGC also works with NGA to coordinate the tasking of 3 satellites for much for the Poles to address NSF PI science goals. The recent collaboration between PGC, NSF and NGA has lowered the cost of sub-meter commercial imagery from tens of thousands of dollars to pennies per square kilometer. The imagery is now provided to PGC at no cost, with the expectation that PGC and NSF provide the infrastructure to retrieve, maintain, and transfer it to the science community. This increased access to the rich dataset fundamentally changes how federally funded research can be done. Almost any researcher would find this imagery useful for applications as broad as coastline erosion, land use/land cover change, examining the surface expression of earthquakes and ice mass balance. PGC provides imagery to NSF-funded researchers working globally in three forms; raw imagery; value added imagery including orthorectified images, image mosaics and digital elevation models; and geospatial web services.

Creating High Resolution Digital Elevation Models for the Poles: PGC and its partners at the Ohio State University (OSU) and Cornell University have been funded to produce a high-resolution, publically available elevation model for the entire arctic.  This project was initiated by the White House to support the United States is chairmanship of the Arctic Council for 2015–2017. The Arctic Council is an inter­governmental forum that promotes cooperation, coordination, and interaction among the Arctic states, communities, and inhabitants. As part of the U.S. chairmanship, the Office of Science and Technology Policy (OSTP) set a goal of producing a high-resolution Arctic Digital Elevation Model with collaboration between NSF and NGA. The DEMs will be provided at no cost to the science community and public at large, and will fulfill the United States’ commitment to its Arctic partners.  The ArcticDEM project adapts PGC’s DEM production capabilities from small area, on-demand production to systemati­cally processing and mosaicking the entire Arctic sub-meter stereo commercial imagery archive. The resources for obtaining and maintaining the imagery, tools for post-processing the results, software development, and distribution methods for the DEM have been secured. Through a 19 million node hour (600 million core hour) Petascale Computing Resource Allocation grant (ACI-1614673) from the NSF, PGC and its partners will use the Blue Waters HPC infrastructure at NCSA to process NGA commercial stereo imagery into an elevation model of the entire Arctic landmass poleward of 60oN and extended south to include all of Alaska, Greenland, and Kamchatka. This ArcticDEM will be constructed from stereo-photogrammetric DEMs extracted from pairs of sub-meter resolution WorldView satellite imagery and vertically registered using ground control from GPS-surveyed points, coordinated airborne LiDAR surveys by NASA’s Operation IceBridge, and historical elevations from the NASA IceSAT mission.

Enabling high-resolution imagery and terrain web services: Once DEMs are extracted from stereo and imagery is processed into orthorectified scenes and mosaics the most useful service to NSF researchers is to have this data available through web services that transparently download the data in standard GIS and remote sensing software. Data can be constantly updated by PGC when the quality improves or when more current imagery is available. Scientists would not have to constantly download new versions of topography.

  • We plan to investigate how satellite imageries can be efficiently delivered to NCSA Blue Waters   for processing and how the generated data of 3D models be stored in a distributed environment.
  • We will investigate how the proposed distributed storage system will enable efficient accesses from NSF researchers to the imagery and terrain data

Observational Data from Astronomy

...

The focus of the OSN pilot is to prototype a national-scale storage platform and network to meet the needs of the research community on all scales. Rather than research or development of completely new technologies, efforts will focus on testing, optimizing and measuring the effectiveness of existing proven storage systems and services in a variety of science-driven use cases, to help guide a broader discussion about the architecture and technical configuration of a production quality national storage network. The use cases specified will clarify the requirements with metrics to measure various solutions. The OSN will need to support an array of technologies and interfaces to meet both the science and engineering use case needs and operational requirements.

  • Connectomics (Harvard University) - The Lichtman Laboratory in the Center for Brain Science and the Molecular and Cellular Biology Department at Harvard University is embarking on a large data intensive project to image and reconstruct all the nerve cells and synaptic interconnections in a cubic millimeter of cerebral cortex. An automated pipeline of image acquisition with a fast multibeam scanning electron microscope will generate images at a rate of about 15TB per day, 7 days per week, resulting in a data set that comprises more than 300 million separate images occupying more than 2PB of storage. These data will be passed through a series of multi (G/C)PU processing steps in order to render all 1 billion synapses and all of the nerve cell wires that make up the neural network. The results of this work will be housed in the NSF-funded Northeast Storage Exchange (NESE), which is physically adjacent to the computing resources at the Massachusetts Green High Performance Computing Center (MGHPCC) that will be processing the data. Availability of the fast and convenient data sharing envisioned by the OSN project will simplify collaboration as the project progresses, increasing the impact of the research.

  • The Critical Zone Observatories (CZO), made up of 10 sites (Boulder Creek, Calhoun, Christina River Basin, Eel River, Intensively Managed Landscapes, Jemez River Basin, and Santa Catalina Mountains, Luquillo, Reynolds Creek, Susquehanna Shale Hills, and Southern Sierra) collect data for the study of the chemical, physical, geological, and biological processes that shape the Earth's surface and supports most terrestrial life on the planet. The program was created to research what scientists call Earth's Critical Zone, the porous "near surface layer" that starts with the tops of trees down to the deepest water sources. The CZO data, roughly 40TB to date, is made up of a diversity of datasets spanning disciplines, spatial and temporal scales, in situ sensors, field instruments, remote sensing, and models. By researching how organisms, rock, air, water, and soil interact, scientists hope to better understand natural habitats as well as concerns like food availability and water quality. To understand these complex reactions researchers from a wide range of disciplines including geosciences, hydrology, microbiology, ecology, soil science, and engineering are needed, with researchers in each of these domains having the ability to access and analyze this combined data.

  • Terra Satellite Data Fusion Products are derived from the petabytes of data from the five instruments on the Terra satellite (ASTER, CERES, MISR, MODIS, and MOPITT) that observe the Earth’s atmosphere, oceans, and surface to provide scientists with data on the Earth’s bio-geochemical and energy systems. Among the most popular NASA datasets, Terra serves the scientific community in areas such as ecology and geoscience towards modeling plant growth and analyzing land usage, as well as government, commercial, and educational studies and needs. The strength of the Terra data has always been rooted in the five differing instruments and the ability to fuse that data together towards obtaining a greater quality of information compared to that obtained from any individual instrument. This has become increasingly important as the data volume has grown along with the need to perform large-scale analytics on the data. These data fusion products combining two or more sensor products are largely community based and must be brought together in order for broader communities to be able to make use of them, requiring storage near computational resources connected by high speed networks in order to pull the needed raw data, generate, and share the fused products.

  • Understanding the roles of sub-mesoscale processes and internal waves in the global ocean through multi-petabyte model and observational data analysis (MIT) – The role of kilometer scale sub- mesoscale ocean processes and of internal wave processes in feedbacks to global scale ocean dynamics is a growing area of research. Ocean dynamics at kilometer scale has emerged as an important piece in improving understanding of mechanisms of ocean Earth climate modulation and in reasoning about the behavior of marine microbe communities. Many important uncertainties in ocean heat and carbon uptake, that hold deep societal relevance, have components that arise from uncertainty about kilometer scale dynamics and its interactions. Recent modeling collaborations between MIT, NASA JPL and NASA Ames have aggressively targeted production of open, shareable sets of simulation output at kilometer scale resolutions - for use in conjunction with field work and for application to theoretical analysis. These studies bring together multi-petabyte (currently 4PB) volumes of simulation output together with smaller but still petabyte volumes of field data (satellite records of sea-surface height, spectral emissivity and reflectance, in-situ genetic sequence measurements, etc.). The MIT contribution to these sharable data sets will be housed in NESE at the Massachusetts Green High Performance Computing Center (MGHPCC), using iRODS and other frameworks, under the direction of Chris Hill, a co-PI on the NESE project and a co-author of the MIT Global Circulation Model. A growing body of distributed research activities in the US and internationally are combining modeling and field work. Ability to scale to 10+PB via the NESE project, combined with the fast and convenient data sharing envisioned by the OSN project, will be transformative for this work, accelerating the research of multiple postdoctoral and graduate student projects regionally, nationally and worldwide.

  • The HathiTrust Research Center collection contains over 14.7 million volumes comprising over 5.1 billion pages, with a total size of 658TB. The HathiTrust Digital Library (HTDL), one of the largest publicly accessible collections of cultural knowledge in the world, acts as a “photocopy” of some of the great research libraries in the U.S., and as such has both wide and deep relevance to many fields—from Astronomy to Zoology, Homer to Hawking. Approximately 61% of the texts (~3.1 billion pages) are under copyright and cannot be distributed outside of HTRC. HTRC is mandated to provide non-consumptive computational research access to this portion of the collection; for the HTRC, its chief challenge is providing cutting edge computational access to this world-class collection of cultural and scientific knowledge, while at the same time rigorously adhering to copyright restrictions and ensuring security of the collection. As HPC and HTC computing go hand-in-hand with much of HTRC’s work, situating this data near computational resources benefits the user community that currently includes digital humanists conducting large-scale mining of the HTDL, computational linguists conducting natural language processing (NLP) experiments, and economists and historians of science, tracking technology diffusion and innovation over the centuries. The dataset serves as a rich and valuable resource for data science research, serving as a testbed for machine learning, NLP algorithms, and indexing methodologies. The outputs of such efforts can serve research within ecology and ecological modeling in terms of historical data otherwise not obtainable, and data fusion in areas such as geoscience where data is heterogeneous and lacking metadata or consistent metadata (a topic being addressed within EarthCube).

  • Machine Learning - Machine learning, an active area of research for some time, has exploded onto the scientific stage in the last few years due to breakthroughs in Deep Learning [Le89, Kri12] coupled with advancements in computing and an increase of available training sets. In nearly all scientific domains, machine learning is being explored as a new analysis tool for making novel discoveries. Examples include phenotyping within biology, land feature extraction within geoscience, infrastructure/community impact within engineering, novel material discovery within material science, as well as exoplanet discovery and the building of faster, more effective software for signal analysis within astronomy. A key and challenging element to contemporary machine learning, especially as applied to specific scientific questions, is the identification of relevant training datasets. Novel techniques within Deep Learning additionally require relatively large training datasets as compared to other machine learning approaches. Towards better enabling discoveries within science using machine learning, a nexus of high value, cross domain data, accessible on the proposed OSN, would be of significant value. In collaboration with the BD Hubs, the OSN will collect a variety of machine learning datasets to be shared on the OSN.

  • The Sloan Digital Sky Survey (SDSS) data, 6 million images making up 70 TB worth of data, is one of the first examples of a large open scientific data set

...

  • , in the public domain, for over a decade [Sky02, Sza00].

...

  • The project and its archive have changed astronomy forever

...

  • , having shown that if the data is of high quality, and if it is presented in an intuitive fashion

...

  • , the community is willing to change its traditional approach and use a virtual telescope. In addition to the data itself, the data management services around the SDSS have been built up to include a scripting interface, where Jupyter notebooks can extract data from a virtual data container, with over 100TB of calibrated flat-file based data, integrated with the database, and can run

...

  • machine learning

...

  • applications over a subset of this data selected by a database query.

...

  • To produce a timely result, such analyses

...

  • need to be performed

...

  • on a

...

  • powerful machine

...

  • ; there is benefit to locating the dataset near compute resources, such as on an OSN node. The SDSS is the first major open e-science archive. Tracking its usage and changes will help the science community to understand the long-term curation of such data sets, and to see what common lessons can be derived for other disciplines facing similar challenges.

  • The Large Synoptic Survey Telescope (LSST) is the flagship project of ground-based US astronomy for the next 15 years

...

  • . Its dedicated telescope in Chile will produce over 100PB of raw data, with thousands of scans of the sky. It will reproduce the sky coverage of the SDSS (collected in 8 years) in less than 4 days. It will open up the time dimension on an unprecedented scale.

...

  • The data will be transferred to NCSA for processing into a Tier0 archive. The science collaborations are organized into many different working groups, by their special science interests. These working groups will produce a variety of topic-specific, value-added data sets that will be used for subsequent specialized analyses (cosmology, gravitational lensing, quasars, stellar systems, asteroids, variable stars, etc.). The relevant subsets of the raw data need to be moved around the US, to the various collaborations (

...

  • Tier 1 groups) and then shared with an even broader community to do

...

  • final analyses (Tier 2 centers).

...

  • The considerable data movement is not part of the core MREFC project, and will require a data infrastructure

...

  • . As the LSST is a good example of other data- intensive resources of the future, it is an excellent

...

  • driver to

...

  • use in the OSN prototype. LSST will assist in the exploration of on-demand movement of large observational data sets and their analyses

...

  1. Use the calibrated SDSS data to build a new set of false color images with a custom parametri­zation, and processing. E.g. build a large mosaic where all objects identified as stars are removed and replaced by blank sky to create an image what the extragalactic sky would look like.

  2. Take a significant dynamic subset of the 3 million SDSS spectra, defined by a SQL query, transmit the spectral data to NCSA and perform a massively parallel machine learning task to classify the given subset of spectra to a much higher resolution. E.g. perform a large scale PCA, then find the small number of the regions in the spectra that contribute the most the given classification, similar to the so called Lick-indices, found heuristically a few decades ago.

  3. Take the data cubes from the SDSS Manga instrument (an integrated fiber unit, IFU) which takes spectra of every course pixel of a given galaxy, and all multicolor images of the same object, then reconstruct the data cube to the best of both the imaging and spectral resolutions using compres­sive sensing. This is an incredibly compute-intensive task, only possible on a supercomputer.

  4. Take a subset of the simulated LSST data and stream it to multiple locations, on demand.

Cross Domain Need for Dataset Publication (NDS, NCSA, SDSC, ...)

...

The dataset is a series of snapshots in time from 4 ultra-high resolution 3D magnetohydrodynamic simulations of rapidly rotating stellar core-collapse. The 3D domain for all simulations is in quadrant symmetry with dimensions 0 < x,y < 66.5km, -66.5km < z < 66.5km. It covers the newly born neutron star and it's shear layer with a uniform resolution. The simulations were performed at 4 different resolutions [500m,200m,100m,50m]. There are a total of 350 snapshots over the simulated time of 10ms with 10 variables capturing the state of the magnetofluid. For the highest resolution simulation, a single 3D output variable for a single time is ~26GB in size. The entire dataset is ~90TB in size. The highest resolution simulation used 60 million CPU hours on BlueWaters. The dataset may be used to analyze the turbulent state of the fluid and perform analysis going beyond the published results in Nature doi:10.1038/nature15755.

HathiTrust Research Center dataset (mirror)
Stephen Downie

Size: 658 TB

HathiTrust Research Center (HTRC) is the research arm of the HathiTrust Digital Library (HTDL). HTDL contains over 14.7 million volumes comprising over 5.1 billion pages, with a total size of 658 terabytes. The HTDL, one of the largest publicly accessible collections of cultural knowledge in the world, acts as a “photocopy” of some of the great research libraries in the U.S., and as such has both wide and deep relevance to many fields—from Astronomy to Zoology, Homer to Hawking. Approximately 61% of the texts (i.e., ~3.1 billion pages) are under copyright and cannot be distributed outside of HTRC. HTRC is thus mandated to provide non-consumptive computational research access to this portion of the collection; for the HTRC, its chief challenge is providing cutting edge computational access to this world-class collection of cultural and scientific knowledge, while at the same time rigorously adhering to copyright restrictions and ensuring security of the collection.  As HPC and HTC computing go hand-in-hand with much of HTRC’s work, having this data near computational resources is of benefit to the user community.  Current users include digital humanists conducting large-scale mining of the HTDL; computational linguists conducting natural language processing (NLP) experiments; and economists and historians of science tracking technology diffusion and innovation over the centuries represented in the corpus.

CARMA dataset
Athol Kembal

Size: 50 TB

...

  • , e.g., streaming a subset of simulated LSST data to multiple locations.

  • The CARMA dataset, 50TB in size, contains the archived data for the Berkeley-Illinois-Maryland (BIMA) millimeter array consortium and that of its successor, the Combined Array for Research in Millimeter Astronomy (CARMA)

...

  • , whose consortium included UIUC, UMd, UCB, Caltech, and U. Chicago.

...

  • The CARMA array ceased operations in 2015 primarily due to the advent of the Atacama Large Millimeter Array (ALMA),

...

  • but both CARMA and BIMA data remain very useful to the millimeter-interferometry community as both a scientific and technical repository. The datasets

...

  • contain visibility data in

...

  • custom binary formats, images in the standard FITS format, and auxiliary data in XML and ASCII text-file format. The data can be processed using community data analysis packages such as MIRIAD and

...

National Optical Astronomy Observatory
Athol Kembel

Size: 500 TB - 1 PB

These data are the public archive of the National Optical Astronomy Observatory (NOAO), acquired by the suite of public optical telescopes and instruments operated by this national center, has the certain potential to be critical in pre-cursor survey science analysis in advance of LSST in partnership with NOAO, contains primarily data in FITS format, and can be reduced with a variety of community software systems. These data would not be served to the community from NCSA, but would form the basis of a critical research collaboration with NOAO in advance of LSST.

Coupled and Uncoupled CESM Simulations with a 25km Atmosphere
Authors: Ryan Sriver
, Hui Li
Size: ~75 TB

This data is from a suite of high-resolution climate model simulations using the Community Earth System Model (CESM, http://www.cesm.ucar.edu/models/).  It consists of three multi-decadal, pre-industrial control simulations featuring different coupling configurations: a) 25 km atmosphere with specified surface ocean temperatures; b) 25 km atmosphere coupled to  a non-dynamic slab ocean; and c) the 25 km atmosphere coupled to a fully-dynamic 3-dimensional model with 1 degree horizontal resolution.  The output represents the first phase of a comprehensive model sensitivity and climate change experiment focusing on the representation of weather and climate extremes in CESM.  The data has broad applications in a wide range of research areas related to climate change science, Earth system modeling, uncertainty quantification, extreme events, decision support, and risk analysis.  The current total output is around 100 Terabytes and is in netcdf format.  The data includes gridded monthly, daily, and 6-hourly outputs of key relevant climate/weather variables for the atmosphere, land, ocean, and sea-ice models.

Critical Zone Observatories
Praveen Kumar

Size: 25 TB - 40 TB

Funded by the National Science Foundation Earth Science division, the Critical Zone Observatories (CZO) collect data for the study of the chemical, physical, geological, and biological processes that shape the Earth's surface and supports most terrestrial life on the planet.  The program was created to research what scientists call Earth's Critical Zone, the porous "near surface layer" that starts with the tops of trees down to the deepest water sources. By researching how organisms, rock, air, water, and soil interact, scientists hope to better understand natural habitats as well as concerns like food availability and water quality.  To understand these complex reactions, especially in light of global warming, researchers from a wide range of disciplines including geosciences, hydrology, microbiology, ecology, soil science, and engineering are needed.  Further a diversity of datasets is needed spanning disciplines, spatial and temporal scales, in situ sensors, field instruments, remote sensing, and models.  There are 10 CZO sites funded to date: Boulder Creek, Calhoun, Christina River Basin, Eel River, Intensively Managed Landscapes (IML), Jemez River Basin & Santa Catalina Mountains, Luquillo, Reynolds Creek, Susquehanna Shale Hills, and Southern Sierra.   The combined data acquired from each of these 10 sites encompass what would be the CZO data.

NASA TERRA Satellite (Mirror)
Larry Di Girolamo
Size: ?

...

ARPA TERRA-REF
David LeBauer
Size: ?

...

NIST Materials Data Facility
Ian Foster
Size: 20 TB

...

  • CASA.

  • Integrating National Data and Watershed Models at the Process Scale: Shortcomings in the existing data and computing infrastructure available to support several current approaches to continental water models ([And06], [Kol06], [PIH11], [Tag04]) have motivated recent workshops to evaluate community model development and analysis infrastructure [CHY11], [CSD09], [CUA08], [CUA09], [NOA12]. To address modern continental-scale water science challenges, the community must come together to share resources and collaborate beyond current paradigms. A scalable, reusable, and extensible research data and analysis infrastructure for continental-scale water models is needed. Workflows that integrate existing national datasets and multiphysics models of the terrestrial water cycle provide a new community strategy for assessing the local impacts of climate and land use change in ungauged watersheds. The macroscopic representation of these processes is typically at scales ranging from sub-meter to hundreds of meters. Data in support of physics-based watershed models at these resolutions exist but not in a form to efficiently carry out the national-scale numerical computations. Geoscientists may spend weeks or months of work collecting and harmonizing the digital data necessary for a typical NSF research project, which would include soils, climate, geology, streams and river networks, etc. The basic data reside on multiple servers with different formats, resolutions and map projections. The researcher thus must decompose the topographic data into a quality numerical mesh, and project physical parameters and climate forcing data onto each mesh element. The OSN will provide harvesting and geo-referencing of the national data, developing the data products that support the modeling, and developing the API’s and web services necessary for making the data accessible for desktop download and model implementation. The proposed OSN cyberinfrastructure capacity and capability, integrated with HydroShare, will enable a variety of re- analysis products via capturing storm data, transforming the way hydrologic science is done.

  • Collaborative Gene Matching (Universities of Maine, New Hampshire, Vermont, Delaware, and Rhode Island): Research teams in Maine, New Hampshire, Vermont, Delaware, and Rhode Island engage separately in water sampling, isolation of organisms, and generation of gene sequences. Leveraging high capacity network pathways that have been deployed over the past five years (catalyzed by CC* funding), these research teams have formed a regional partnership to share data and computing resources. When data sharing between sites was introduced, participants found that unknown organisms in one partner’s database were often known organisms in another partner’s database. This exchange of information has accelerated characterization of water samples at all sites. The OSN will facilitate information sharing nationally, adding new partners and linking similar small-scale partnerships.