Container Analysis Environments Participants

Below are brief descriptions of the participating groups and projects along with presentation titles/topics when available.  Please let me know if I've made any mistakes or feel free to correct/add detail as you see fit.

ProjectsDescription

SciServer

(Jai Won Kim, Gerard Lemson)

SciServer Compute supports interactive and batch access to to multiple large public datasets across several domains (including the Sloan Digital Sky Survey) via containers. They support Rstudio/Jupyter/Matlab interactive environments and have developed a custom job scheduler for containers, each with supporting scripting libraries for SciServer component and data integration. SciServer compute supports SSO with user defined group access to shared storage and access to centralized datasets, some in relational databases

See also:

Cyverse (Ian McEwen)

The CyVerse Discovery Environment (DE) uses containers (via Docker) to support customizable, non-interactive, reproducible workflows using data stored in the CyVerse Data Store, based on iRODS. Additionally, the DE supports using Agave to run analyses in HPC environments, which may themselves be containerized. They are actively working on enabling interactive jobs through the same system, as well as a "bring your own compute" system for those with access to their own computational resources, all made more possible, flexible, and reproducible through containerization.

See also:

TACC (John Fonner)

TACC has installed Singularity container support on all of its HPC systems and is working with BioContainers to make 2400+ BioConda applications findable and accessible at TACC or any HPC system that supports Singularity. The end result of these efforts is to support all BioConda packages across all Cyverse infrastructure. This already works using Docker on the Cyverse Condor cluster and will also provide solution for other HPC systems using Singularity. 

Whole Tale, yt.Hub, RSL, Data Exploration Lab

(Matt Turk, Kacper Kowalik)

The yt Hub provides access to very large datasets (both observation and simulation based) via the integration of Girder and Jupyter Notebook/Lab. Entire available data is locally mounted to compute nodes of a Docker Swarm cluster via NFS. However, the physical location of the data is abstracted through a FUSE filesystem, which allows to provide only a subset of data selected by the user inside the container running Jupyter Notebook/Lab.

The basic architecture of the yt Hub: Girder + remote environment with data selection, is currently being extended as a part of the Whole Tale project, which provides (among other things) the ability to launch containerized applications over a wide variety of the *remote* datasets (e.g., via DataOne). They are addressing complexity of exposing data to containers via a variety of underlying mechanisms (posix, S3, HTTP, Globus, etc) through a data management framework. In contrast to the yt Hub, data is provided inside the computing environment on demand using a sync mechanism and local cache, rather than being served locally. Containers also play a role in provenance/preservation of scientific workflows and publication process.

The Renaissance Labs project will leverage this same approach to provide access to the Renaissance Simulations at SDSC – adding the ability to move analysis to HPC resources and adding a custom UI.

TERRA-REF

(David LeBauer, Max Burnette)

The TERRA-REF projects provides access to a large reference dataset for plant biology.  They support interactive access to the data via the NDS Labs Workbench (see below), which allows users to deploy customized container environments near the data.  They also use containers for the data processing pipeline, which runs on a combination of VMs and the ROGER cluster.

See also:

Blue Waters (Greg Bauer)

The Blue Waters supercomputer now supports containers through NERSC's Shifter. Container support was added to support three use cases (LCH/Atlas, LIGO, LSST/DES). Technical challenges include containerizing applications, handling permissions, storage/IO, and MPI.

See also:

LSST/DES (Matias Kind)

The Dark Energy Survey (DES) project uses a Blue Waters allocation (not in containers) to run processes on raw images and generate catalog data. The DES Labs service provides access to the resulting catalog and image data for over 500 collaborators. All services run in containers on Nebula (DESCut, Jupyterhub, Web client) using Kubernetes. Catalog data (1PB) is available via Oracle. Shared storage is provided via NFS/Cinder.

The Large Synoptic Survey Telescope (LSST) is planning to use a Kubernetes cluster for production (reliable, high-availability). With many collaborators, containerization of pipelines offers many benefits (isolation, deployment at scale). LSST will similarly offer Jupyterhub to control notebooks.

See also:

LIGO/OSG (Roland Haas, Eliu Huerta)Title: BOSS-LDG: A Novel Computational Framework that Brings Together Blue Waters, Open Science Grid, Shifter and the LIGO Data Grid to Accelerate Gravitational Wave Discovery

Abstract: We present a novel computational framework that connects Blue Waters, the NSF-supported, leadership-class supercomputer operated by NCSA, to the Laser Interferometer Gravitational-Wave Observatory (LIGO) Data Grid via Open Science Grid technology. To enable this computational infrastructure, we configured, for the first time, a LIGO Data Grid Tier-1 Center that can submit heterogeneous LIGO workflows using Open Science Grid facilities. In order to enable a seamless connection between the LIGO Data Grid and Blue Waters via Open Science Grid, we utilize Shifter to containerize LIGO’s workflow software. This work represents the first time Open Science Grid, Shifter, and Blue Waters are unified to tackle a scientific problem and, in particular, it is the first time a framework of this nature is used in the context of large scale gravitational wave data analysis. This new framework is designed to run the most computationally demanding gravitational wave search workflows on Blue Waters and accelerate discovery in the emergent field of gravitational wave astrophysics. We discuss the implications of this novel framework for a wider ecosystem of Higher Performance Computing users. 

NDS

(Craig Willis, Mike Lambert)

Presentation: NDS Labs Workbench and DataDNS

Description: Brief presentation of the Labs Workbench service and how it's being used to support in-place analysis of data, development, and education/training environments. We'll also introduce the DataDNS initiative.

The NDS Labs Workbench is a generic platform for launching containerized environments near remote datasets, leveraging Kubernetes. Labs Workbench is deployed on OpenStack as a Kubernetes cluster with GlusterFS for a shared user filesystem across containers (e.g., home directory). Workbench is used by the TERRA-REF project and increasingly for training/education environments (hackathons, workshops, bootcamps, etc). The DataDNS project is an emerging vision for supporting access to remote computational environments. Workbench is a single optional component of the DataDNS framework.

See also:

CyberGIS

(Yan Liu, Jeff Terstriep)

The CyberGIS project recently developed CyberGIS-Jupyter to integrate cloud-based Jupyter notebooks with HPC resources. The project adopts Jupyter notebooks instead of web GIS as the front-end interface for both developers and users. Advanced GIS capabilities are provided in a pre-configured containerized environment. The system also supports on-demand provisioning to deploy multiple instances of gateway applications.

See also:

SDSC (Andrea Zonca)

Presentation: Jupyterhub + Singularity on HPC

Description of an experimental deployment of Jupyterhub with Globus Authentication on SDSC Cloud Openstack with Notebooks spawning on Comet computing nodes with Singularity. Users should be able to choose container or bring their own. Use cases of Jupyterhub in HPC for teaching, research and Science Gateways. Also, Jupyterhub with Docker Swarm on SDSC Cloud Openstack for interactive analysis to support HPC jobs on Comet.

See also:

Coastal Resilience Collaboratory (Jian Tao)

CRC is developing an integrated, coupled modeling framework for the coastal modeling community to facilitate the deployment of complex models on cloud and cloud-like architectures with negligible performance overhead. The Coastal Model Repository (CMR) targets cloud and cloud-like architectures to enable quick deployment of coastal models and their working environments. CMR serves as a community repository for precompiled open source models that are widely used by coastal researchers. CMR distribute containerized coastal models via Docker hub.