Collaborative Projects & Pilots
NASA Access to Terra Data Fusion
Terra is the flagship of NASA’s Earth Observing System. Launched in 1999, Terra’s five instruments continue to gather data that enable scientists to address fundamental questions that are central to the six NASA Earth Science Research Focus Areas. It is amongst the most popular NASA datasets, serving not only the scientific community, but also governmental, commercial, and educational communities.
The strength of the Terra mission has always been rooted in its five instruments and the ability to fuse the instrument data together for obtaining greater quality of information for Earth Science compared to individual instruments alone. As the data volume grows and the central Earth Science questions shift from process-oriented to climate-oriented questions, the need for data fusion and the ability for scientists to perform large-scale analytics with long records have never been greater. The challenge is particularly acute for Terra, given its growing volume of data (> 1 petabyte), the storage of different instrument data at different archive centers, the different file formats and projection systems employed for different instrument data, and the inadequate cyberinfrastructure for scientists to access and process whole-mission fusion data (including Level 1 data). Sharing newly derived Terra products with the rest of the world also poses challenges.
The ACCESS to Terra Data Fusion Products effort aims to resolve two long-standing problems:
- How do we efficiently generate and deliver Terra data fusion products?
- How do we facilitate the use of Terra data fusion products by the community in generating new products and knowledge through national computing facilities, and disseminate these new products and knowledge through national data sharing services?
The effort leverages national facilities and services that are managed by the National Center for Supercomputing Applications (NCSA), specifically the National Petascale Computing Facility, which houses the Blue Waters supercomputer, and the National Data Service (NDS). Key advantages of leveraging Blue Waters and the NDS for access, usage, and distribution of Terra data fusion products and science results are that the Terra data and processing are local, with access and sharing that are global. This represents a significant community-element addition to NASA’s system of systems infrastructure. ACCESS to Terra Data Fusion Products will initiate the development, access and delivery of Level 1B radiance Terra Fusion files for the broader community. Level 1B fusion provides the necessary stepping-stone for developing higher-level products, and provides the framework for other flavors of fusion. Enhancements to our existing open source codes in the CyberGIS Toolkit for scalable map projections on any grid for the new Terra Fusion files will also be delivered.
NIH BD2K KnowEng
KnowEnG (pronounced "knowing") is a National Institutes of Health-funded initiative that brings together researchers from the University of Illinois and the Mayo Clinic to create a Center of Excellence in Big Data Computing. It is part of the Big Data to Knowledge (BD2K) Initiative that NIH launched in 2012 to tap the wealth of information contained in biomedical Big Data. KnowEnG is one of 11 Centers of Excellence in Big Data Computing funded by NIH in 2014.
This four-year project will create a platform where biomedical scientists, clinical researchers, and bioinformaticians can bring their own data and perform common as well as advanced analysis tasks, guided by the “knowledge network”, a large compendium of public-domain data. The knowledge network embodies community data on genes, proteins, functions, species, and phenotypes, and relationships among them. Instead of analyzing their data set in an isolated fashion, researchers will be able to go straight to asking global questions. The infrastructure, capacity and tools will grow with the datasets.
- https://github.com/KnowEnG-Research
- https://hub.docker.com/u/knowengdev/
- knowdev@lists.illinois.edu (mail list)
NSF DIBBs Whole Tale
Scholarly publications today are still mostly disconnected from the underlying data and code used to produce the published results and findings, despite an increasing recognition of the need to share all aspects of the research process. As data become more open and transportable, a second layer of research output has emerged, linking research publications to the associated data, possibly along with its provenance. This trend is rapidly followed by a new third layer: communicating the process of inquiry itself by sharing a complete computational narrative that links method descriptions with executable code and data, thereby introducing a new era of reproducible science and accelerated knowledge discovery. In the Whole Tale (WT) project, all of these components are linked and accessible from scholarly publications. The third layer is broad, encompassing numerous research communities through science pathways (e.g., in astronomy, life and earth sciences, materials science, social science), and deep, using interconnected cyberinfrastructure pathways and shared technologies. The goal of this project is to strengthen the second layer of research output, and to build a robust third layer that integrates all parts of the story, conveying the holistic experience of reproducible scientific inquiry by (1) exposing existing cyberinfrastructure through popular frontends, e.g., digital notebooks (IPython, Jupyter), traditional scripting environments, and workflow systems; (2) developing the necessary 'software glue' for seamless access to different backend capabilities, including from DataNet federations and Data Infrastructure Building Blocks (DIBBs) projects; and (3) enhancing the complete data-to-publication lifecycle by empowering scientists to create computational narratives in their usual programming environments, enhanced with new capabilities from the underlying cyberinfrastructure (e.g., identity management, advanced data access and provenance APIs, and Digital Object Identifier-based data publications). The technologies and interfaces will be developed and stress-tested using a diverse set of data types, technical frameworks, and early adopters across a range of science domains.
- http://wholetale.org/
- https://opensource.ncsa.illinois.edu/confluence/display/WT
- https://opensource.ncsa.illinois.edu/jira/projects/WT
NIST IN-CORE
The new center will collaborate with NIST to achieve its long-term goal of developing tools that individual communities can use to assess their resilience. This includes evaluating the effectiveness of alternative measures intended to improve performance and minimize post-disaster disruption and recovery time. These tools will improve decision-making so that communities can build a “business case” for the measures they take. The centerpiece of the center’s effort will be NIST-CORE—the NIST-Community Resilience Modeling Environment. NIST-CORE will be built on the Ergo software (http://ergo.ncsa.illinois.edu/), developed at NCSA for hazard assessment, response, and planning. Ergo is already used around the world, and according to NCSA’s Danny Powell, this collaboration with NIST will further expand on the functionality and applications currently available through the software platform. The National Data Service consortium, of which NCSA is a founding member, will also be part of the project, working with NIST-CORE developers and researchers on data publishing.
NIST Materials Data Facility
The Materials Data Facility (MDF) is a collaboration between Globus at the University of Chicago, the National Center for Supercomputing Applications (NCSA-UIUC), and the Center for Hierarchical Materials Design (CHiMaD) a NIST-funded center of excellence.
MDF is developing key data services for materials researchers with the goal of promoting open data sharing, simplifying data publication and curation workflows, encouraging data reuse, and providing powerful data discovery interfaces for data of all sizes and sources. Specifically, MDF services will allow individual researchers and institutions to 1) enable publication of large research datasets with flexible policies; 2) grant the ability to publish data directly from local storage, institutional data stores, or from cloud storage, without third-party publishers; 3) build extensible domain-specific metadata and automated metadata ingestion scripts for key data types; 4) develop publication workflows; 5) register a variety of resources for broader community discovery; and 6) access a discovery model that allows researchers to search, interrogate, and build upon existing published data.
- https://www.materialsdatafacility.org/
- http://matsci.registry.nationaldataservice.org/
- http://nist.registry.nationaldataservice.org
- http://bipm.registry.nationaldataservice.org
- http://mgi.registry.nationaldataservice.org:8181/ (Materials Resource Registry)
- https://trial.publish.globus.org/ (Clowder instance following master branch)
- http://petrel.alcf.anl.gov/ (Clowder instance following current development branch, feature/CATS-224-ability-to-launch-vm-from-dataset)
- https://trello.com/b/lmDf7NDa/materials-data-facility (Issue tracker)
- http://ndswiki.ncsa.illinois.edu/MaterialsMetadataDevelopment (Metadata)
- Globus endpoints: ncsa#mdf (141.142.208.128),
ncsa#mdf-publish (141.142.193.28) - Team Resources
NSF Midwest BigData Hub
The nation faces increasing challenges in collecting, managing, serving, mining, and analyzing rapidly growing and increasingly complex data and information collections to create actionable knowledge and guide decision-making. All sectors of society are profoundly impacted and need novel solutions that leverage the breadth of expertise in academia, industry, and government. To address this need, a diverse and committed network of partners has created a nimble and flexible regional Midwest Big Data Hub (MBDH), responding to Big Data challenges and capturing special opportunities, interests, and resources unique to the Midwest.
iSEE Plants in silico
As the Earth’s population climbs toward 9 billion by 2050 — and the world climate continues to change, affecting temperatures, weather patterns, water supply, and even the seasons — future food security has become a grand world challenge. Accurate prediction of how food crops react to climate change will play a critical role in ensuring food security. An ability to computationally mimic the growth, development and response of plants to the environment will allow researchers to conduct many more experiments than can realistically be achieved in the field. Designing more sustainable crops to increase productivity depends on complex interactions between genetics, environment, and ecosystem. Therefore, creation of an in silico — computer simulation — platform that can link models across different biological scales, from cell to ecosystem level, has the potential to provide more accurate simulations of plant response to the environment than any single model could alone.
As a leader in plant biology, crop sciences and computer science, Illinois is uniquely positioned to head this initiative. Developments in high-performance computing, open-source version-controlled software, advanced visualization tools, and functional knowledge of plants make achieving the concept realistic. The interdisciplinary Plants in silico team will take advantage of resources in the National Center for Supercomputing Applications (NCSA) and the Institute for Sustainability, Energy, and Environment (iSEE) — and its academic and research expertise in plant biology, crop sciences, and bioengineering — to build a user-friendly platform for plant scientists around the globe who are working on the food security challenge.
- http://sustainability.illinois.edu/research/climate-solutions/plants-in-silico-project/
- https://github.com/rachelshekar/Cis_Repository
- https://github.com/rachelshekar/Psi-old-repository
- cis@lists.illinois.edu (mail list)
- Team Resources
ARPAE TERRA
Phenotypes are measurable features that indicate how they will grow and respond to stresses such as heat, drought, and pathogens. Breeding is currently limited by the speed at which phenotypes can be measured, and the information that can be extracted from these measurements. Currently, measurements used to predict yield include measuring leaf thickness with a caliper or height with a meter stick. More sophisticated instruments used to quantify plant architecture, carbon uptake, water use, and root growth do not scale to the thousands or tens of thousands of individual plants that need to be evaluated in a breeding program. TERRA-REF will develop an integrated phenotyping system for energy Sorghum that leverages genetics and breeding, automation, remote plant sensing, genomics, and computational analytics.
- http://terraref.org/
- https://terraref.ncsa.illinois.edu/
- https://www.youtube.com/watch?v=Pp6IdkPtFC8&feature=youtu.be
- http://141.142.208.144/clowder/ (Production Clowder Instance)
- http://141.142.209.122/clowder/
- https://github.com/terraref (Repository)
- https://github.com/terraref/computing-pipeline/issues (Issue tracker)
- https://opensource.ncsa.illinois.edu/jira/issues/?jql=labels%20%3D%20TERRA (Issue tracker)