Polar Geospatial Center (PGC)

At University of Minnesota, we choose PGC as an example of science driven use cases. The Polar Geospatial Center is an NSF/GEO/PLR funded national center (PLR- 1559691) that provides geospatial and logistical support for the NSF funded polar science community. PGC is also the NSF liaison with the National Geospatial Intelligence Agency’s (NGA) sub-meter commercial imagery program and currently holds a global collection of sub-meter imagery of approximately 3.2 PB comprised of 7.8 million scenes with an area of over 2 billion km2 that is currently increasing at a rate of 2-10 TB daily. PGC also works with NGA to coordinate the tasking of 3 satellites for much for the Poles to address NSF PI science goals. The recent collaboration between PGC, NSF and NGA has lowered the cost of sub-meter commercial imagery from tens of thousands of dollars to pennies per square kilometer. The imagery is now provided to PGC at no cost, with the expectation that PGC and NSF provide the infrastructure to retrieve, maintain, and transfer it to the science community. This increased access to the rich dataset fundamentally changes how federally funded research can be done. Almost any researcher would find this imagery useful for applications as broad as coastline erosion, land use/land cover change, examining the surface expression of earthquakes and ice mass balance. PGC provides imagery to NSF-funded researchers working globally in three forms; raw imagery; value added imagery including orthorectified images, image mosaics and digital elevation models; and geospatial web services.

Creating High Resolution Digital Elevation Models for the Poles: PGC and its partners at the Ohio State University (OSU) and Cornell University have been funded to produce a high-resolution, publically available elevation model for the entire arctic. This project was initiated by the White House to support the United States is chairmanship of the Arctic Council for 2015–2017. The Arctic Council is an intergovernmental forum that promotes cooperation, coordination, and interaction among the Arctic states, communities, and inhabitants. As part of the U.S. chairmanship, the Office of Science and Technology Policy (OSTP) set a goal of producing a high-resolution Arctic Digital Elevation Model with collaboration between NSF and NGA. The DEMs will be provided at no cost to the science community and public at large, and will fulfill the United States’ commitment to its Arctic partners. The ArcticDEM project adapts PGC’s DEM production capabilities from small area, on-demand production to systematically processing and mosaicking the entire Arctic sub-meter stereo commercial imagery archive. The resources for obtaining and maintaining the imagery, tools for post-processing the results, software development, and distribution methods for the DEM have been secured. Through a 19 million node hour (600 million core hour) Petascale Computing Resource Allocation grant (ACI-1614673) from the NSF, PGC and its partners will use the Blue Waters HPC infrastructure at NCSA to process NGA commercial stereo imagery into an elevation model of the entire Arctic landmass poleward of 60oN and extended south to include all of Alaska, Greenland, and Kamchatka. This ArcticDEM will be constructed from stereo-photogrammetric DEMs extracted from pairs of sub-meter resolution WorldView satellite imagery and vertically registered using ground control from GPS-surveyed points, coordinated airborne LiDAR surveys by NASA’s Operation IceBridge, and historical elevations from the NASA IceSAT mission.

Enabling high-resolution imagery and terrain web services: Once DEMs are extracted from stereo and imagery is processed into orthorectified scenes and mosaics the most useful service to NSF researchers is to have this data available through web services that transparently download the data in standard GIS and remote sensing software. Data can be constantly updated by PGC when the quality improves or when more current imagery is available. Scientists would not have to constantly download new versions of topography.

We plan to investigate how satellite imageries can be efficiently delivered to NCSA Blue Waters for processing and how the generated data of 3D models be stored in a distributed environment.
We will investigate how the proposed distributed storage system will enable efficient accesses from NSF researchers to the imagery and terrain data

Observational Data from Astronomy

JHU is hosting the database for the Sloan Digital Sky Survey (SDSS) project, often called the “Cosmic Genome Project”. The data has been one of the first examples of a large open scientific data set. It has been in the public domain for over a decade [Sky02, Sza00]. It is fair to say that the project and its archive have changed astronomy forever. It has shown that a whole community is willing to change its traditional approach and use a virtual telescope, if the data is of high quality, and if it is presented in an intuitive fashion. The continuous evolution and curation of the system over the years has been an intense effort, and has given us a unique perspective of the challenges involved in the broader problem of operating open archival systems over a decade.

The SDSS is special – it has been the first major open e-science archive. Tracking its usage as well as the changes that occurred will help the whole science community to understand the long term curation of such data sets, and see what common lessons can be derived for other disciplines facing similar challenges. These archives not only serve flat files and simple digital objects, but they also present complex services. Recently we have introduced a scripting interface on top of the data, where iPython/Jupyter notebooks can extract data from a virtual data container, with over 100TB of calibrated flat-file based data, integrated with the database, and can run e.g. machine learning scripts over a subset of this data selected by a database query. While we provide access at JHU to about 100 virtual machines, some computations may touch a large fraction of the 70TB image repository, with over 6 million images. In order to produce a timely result, such analyses should be performed at a more powerful machine. With the right infrastructure, the 70TB could be transmitted on-demand to NCSA in less than two hours, and the analysis ran in a short time. The data can also be streamed there just-in-time.

LSST: The Large Synoptic Survey Telescope: The flagship project of ground-based US astronomy for the next 15 years is the LSST project. Its dedicated telescope in Chile will produce over 100PB of raw data, with thousands of scans of the sky. It will reproduce the sky coverage of the SDSS (collected in 8 years) in less than 4 days. It will open up the time dimension on an unprecedented scale. The data will be transferred to NCSA for processing into a Tier0 archive. The science collaborations are organized into many different working groups, by their special science interests. These working groups will produce a variety of topic specific value-added data sets that will be used for subsequent specialized analyses (cosmology, gravitational lensing, quasars, stellar systems, asteroids, variable stars, etc). The relevant subsets of the raw data need to be moved around the US, to the various collaborations (Tier1 groups) and then shared with an even broader community to do the final analyses (Tier 2 centers). All this data movement is not part of the core MREFC project, and will require a data infrastructure (not necessarily dedicated). As the LSST is a good example of other data-intensive resources of the future, it is an excellent example to be used in our prototypes. We will use these projects to explore on-demand movement of large observational data sets and their analyses. Particular projects in mind:

Use the calibrated SDSS data to build a new set of false color images with a custom parametrization, and processing. E.g. build a large mosaic where all objects identified as stars are removed and replaced by blank sky to create an image what the extragalactic sky would look like.
Take a significant dynamic subset of the 3 million SDSS spectra, defined by a SQL query, transmit the spectral data to NCSA and perform a massively parallel machine learning task to classify the given subset of spectra to a much higher resolution. E.g. perform a large scale PCA, then find the small number of the regions in the spectra that contribute the most the given classification, similar to the so called Lick-indices, found heuristically a few decades ago.
Take the data cubes from the SDSS Manga instrument (an integrated fiber unit, IFU) which takes spectra of every course pixel of a given galaxy, and all multicolor images of the same object, then reconstruct the data cube to the best of both the imaging and spectral resolutions using compressive sensing. This is an incredibly compute-intensive task, only possible on a supercomputer.
Take a subset of the simulated LSST data and stream it to multiple locations, on demand.

Cross Domain Need for Dataset Publication (NDS, NCSA, SDSC, ...)

MHD Turbulence in Core-Collapse Supernovae
Authors: Philipp Moesta, Christian Ott
Size: 90 TB

The dataset is a series of snapshots in time from 4 ultra-high resolution 3D magnetohydrodynamic simulations of rapidly rotating stellar core-collapse. The 3D domain for all simulations is in quadrant symmetry with dimensions 0 < x,y < 66.5km, -66.5km < z < 66.5km. It covers the newly born neutron star and it's shear layer with a uniform resolution. The simulations were performed at 4 different resolutions [500m,200m,100m,50m]. There are a total of 350 snapshots over the simulated time of 10ms with 10 variables capturing the state of the magnetofluid. For the highest resolution simulation, a single 3D output variable for a single time is ~26GB in size. The entire dataset is ~90TB in size. The highest resolution simulation used 60 million CPU hours on BlueWaters. The dataset may be used to analyze the turbulent state of the fluid and perform analysis going beyond the published results in Nature doi:10.1038/nature15755.

HathiTrust Research Center dataset (mirror)
Stephen Downie
Size: 658 TB

HathiTrust Research Center (HTRC) is the research arm of the HathiTrust Digital Library (HTDL). HTDL contains over 14.7 million volumes comprising over 5.1 billion pages, with a total size of 658 terabytes. The HTDL, one of the largest publicly accessible collections of cultural knowledge in the world, acts as a “photocopy” of some of the great research libraries in the U.S., and as such has both wide and deep relevance to many fields—from Astronomy to Zoology, Homer to Hawking. Approximately 61% of the texts (i.e., ~3.1 billion pages) are under copyright and cannot be distributed outside of HTRC. HTRC is thus mandated to provide non-consumptive computational research access to this portion of the collection; for the HTRC, its chief challenge is providing cutting edge computational access to this world-class collection of cultural and scientific knowledge, while at the same time rigorously adhering to copyright restrictions and ensuring security of the collection. As HPC and HTC computing go hand-in-hand with much of HTRC’s work, having this data near computational resources is of benefit to the user community. Current users include digital humanists conducting large-scale mining of the HTDL; computational linguists conducting natural language processing (NLP) experiments; and economists and historians of science tracking technology diffusion and innovation over the centuries represented in the corpus.

CARMA dataset
Athol Kembal
Size: 50 TB

UIUC/NCSA was a founding member of the Berkeley-Illinois-Maryland (BIMA) millimeter array consortium and its successor the Combined Array for Research in Millimeter Astronomy (CARMA). The CARMA consortium included UIUC, UMd, UCB, Caltech, and U.Chicago. NCSA/UIUC serves as the primary archive site and user access center for these two datasets. The CARMA array ceased operations in 2015 primarily due to the advent of the Atacama Large Millimeter Array (ALMA), however both CARMA and BIMA data remain very useful to the millimeter-interferometry community as both a scientific and technical repository. The datasets total approximately 50TB, contain visibility data in custom binary formats, images in the standard FITS format, and auxiliary data in XML and ASCII text-file format. The data can be processed using community data analysis packages such as MIRIAD and casa. The point of contact for the archive is Athol Kemball (NCSA)

National Optical Astronomy Observatory
Athol Kembel
Size: 500 TB - 1 PB

These data are the public archive of the National Optical Astronomy Observatory (NOAO), acquired by the suite of public optical telescopes and instruments operated by this national center, has the certain potential to be critical in pre-cursor survey science analysis in advance of LSST in partnership with NOAO, contains primarily data in FITS format, and can be reduced with a variety of community software systems. These data would not be served to the community from NCSA, but would form the basis of a critical research collaboration with NOAO in advance of LSST.

Coupled and Uncoupled CESM Simulations with a 25km Atmosphere
Authors: Ryan Sriver, Hui Li
Size: ~75 TB

This data is from a suite of high-resolution climate model simulations using the Community Earth System Model (CESM, http://www.cesm.ucar.edu/models/). It consists of three multi-decadal, pre-industrial control simulations featuring different coupling configurations: a) 25 km atmosphere with specified surface ocean temperatures; b) 25 km atmosphere coupled to a non-dynamic slab ocean; and c) the 25 km atmosphere coupled to a fully-dynamic 3-dimensional model with 1 degree horizontal resolution. The output represents the first phase of a comprehensive model sensitivity and climate change experiment focusing on the representation of weather and climate extremes in CESM. The data has broad applications in a wide range of research areas related to climate change science, Earth system modeling, uncertainty quantification, extreme events, decision support, and risk analysis. The current total output is around 100 Terabytes and is in netcdf format. The data includes gridded monthly, daily, and 6-hourly outputs of key relevant climate/weather variables for the atmosphere, land, ocean, and sea-ice models.

Critical Zone Observatories
Praveen Kumar
Size: 25 TB - 40 TB

Funded by the National Science Foundation Earth Science division, the Critical Zone Observatories (CZO) collect data for the study of the chemical, physical, geological, and biological processes that shape the Earth's surface and supports most terrestrial life on the planet. The program was created to research what scientists call Earth's Critical Zone, the porous "near surface layer" that starts with the tops of trees down to the deepest water sources. By researching how organisms, rock, air, water, and soil interact, scientists hope to better understand natural habitats as well as concerns like food availability and water quality. To understand these complex reactions, especially in light of global warming, researchers from a wide range of disciplines including geosciences, hydrology, microbiology, ecology, soil science, and engineering are needed. Further a diversity of datasets is needed spanning disciplines, spatial and temporal scales, in situ sensors, field instruments, remote sensing, and models. There are 10 CZO sites funded to date: Boulder Creek, Calhoun, Christina River Basin, Eel River, Intensively Managed Landscapes (IML), Jemez River Basin & Santa Catalina Mountains, Luquillo, Reynolds Creek, Susquehanna Shale Hills, and Southern Sierra. The combined data acquired from each of these 10 sites encompass what would be the CZO data.

NASA TERRA Satellite (Mirror)
Larry Di Girolamo
Size: ?

...

ARPA TERRA-REF
David LeBauer
Size: ?

...

NIST Materials Data Facility
Ian Foster
Size: 20 TB

...

OSN Phase 1 Use Cases

[data-colorid=y8y0lbfvcw]{color:#1155cc} html[data-color-mode=dark] [data-colorid=y8y0lbfvcw]{color:#3377ee}[data-colorid=o36jqb2376]{color:#1155cc} html[data-color-mode=dark] [data-colorid=o36jqb2376]{color:#3377ee}Polar Geospatial Center (PGC)

Observational Data from Astronomy

Cross Domain Need for Dataset Publication (NDS, NCSA, SDSC, ...)

Polar Geospatial Center (PGC)