Container Analysis Environments Participants

Below are brief descriptions of the participating groups that might be useful as you consider your presentation topics. Please let me know if I've made any mistakes or feel free to correct/add detail as you see fit.

Projects

Description

SciServer

(Kim, Lemson)

SciServer Compute supports interactive and batch access to to multiple large public datasets across several domains (including the Sloan Digital Sky Survey) via containers. They support Rstudio/Jupyter/Matlab interactive environments and have developed a custom job scheduler for containers, each with supporting scripting libraries for SciServer component and data integration. SciServer compute supports SSO with user defined group access to shared storage and access to centralized datasets, some in relational databases

See also:

Devisetty et al. Bringing your tools to CyVerse Discovery Environment using Docker. F1000Res.

Whole Tale, yt.Hub, RSL, Data Exploration Lab

(Turk, Kowalik)

The yt Hub provides access to very large datasets (both observation and simulation based) via the integration of Girder and Jupyter Notebook/Lab. Entire available data is locally mounted to compute nodes of a Docker Swarm cluster via NFS. However, the physical location of the data is abstracted through a FUSE filesystem, which allows to provide only a subset of data selected by the user inside the container running Jupyter Notebook/Lab.

The basic architecture of the yt Hub: Girder + remote environment with data selection, is currently being extended as a part of the Whole Tale project, which provides (among other things) the ability to launch containerized applications over a wide variety of the *remote* datasets (e.g., via DataOne). They are addressing complexity of exposing data to containers via a variety of underlying mechanisms (posix, S3, HTTP, Globus, etc) through a data management framework. In contrast to the yt Hub, data is provided inside the computing environment on demand using a sync mechanism and local cache, rather than being served locally. Containers also play a role in provenance/preservation of scientific workflows and publication process.

The Renaissance Labs project will leverage this same approach to provide access to the Renaissance Simulations at SDSC – adding the ability to move analysis to HPC resources and adding a custom UI.

TERRA-REF

(LeBauer, Burnette)

The TERRA-REF projects provides access to a large reference dataset for plant biology. They support interactive access to the data via the NDS Labs Workbench (see below), which allows users to deploy customized container environments near the data. They also use containers for the data processing pipeline, which runs on a combination of VMs and the ROGER cluster.

Blue Waters

The Blue Waters supercomputer now supports containers through NERSC's Shifter. Container support was added to support three use cases (LCH/Atlas, LIGO, LSST/DES). Technical challenges include containerizing applications, handling permissions, storage/IO, and MPI.

See also:

Interacting with Shifter on Blue Waters

LSST/DES

(Kind)

This is one of the drivers for the BW Shifter implementation. They also use Docker/Kubernetes for a variety of other services, including DES Labs.

See also:

Data Access for Astronomical Surveys: A brief description and demo of tools used by DES

NDS

(Willis, Lambert, Coakley)

The NDS Labs Workbench is a generic platform for launching containerized environments near remote datasets, leveraging Kubernetes. Labs Workbench is deployed on OpenStack as a Kubernetes cluster with GlusterFS for a shared user filesystem across containers (e.g., home directory). Workbench is used by the TERRA-REF project and increasingly for training/education environments (hackathons, workshops, bootcamps, etc). The DataDNS project is an emerging vision for supporting access to remote computational environments. Workbench is a single optional component of the DataDNS framework.

See also:

Willis et al. 2017. Container-based analysis environments for low-barrier access to research data. PEARC'17.

CyberGIS

(Liu, Terstriep)

The CyberGIS project recently developed CyberGIS-Jupyter to integrate cloud-based Jupyter notebooks with HPC resources. The project adopts Jupyter notebooks instead of web GIS as the front-end interface for both developers and users. Advanced GIS capabilities are provided in a pre-configured containerized environment. The system also supports on-demand provisioning to deploy multiple instances of gateway applications.

See also:

Yin et al 2017. A CyberGIS-Jupyter Framework for Geospatial Analytics at Scale. PEARC'17.

SDSC (Zonca)

Deployment of Jupyterhub with Docker Swarm and batch spawner support in HPC environments in support of science gateways, research, and training/education.

See also: