Module 3: Containers and Reproducible Builds

Module Overview

"Works on my machine" is a common state of code developed by people lacking in software engineering background. It must be reproducible for code (and science) to work.

We've already learned about pipenv as a Python packaging tool, which goes a long way towards giving reproducible builds - but for even greater reproducibility (and deployability), containers are the tool of choice. A container is a minimal virtual operating system, complete with all the software needed to run the desired application. Because they pack everything together, they are identical to run regardless of host.

Docker is a common standard and tool for containers, and we will use it to build and run Linux containers with Python code.

Learning Objectives

Objective 01 - Launch Docker containers and execute programs on them

Overview

The purpose of containers is to run code, reliably and reproducibly. Even something as simple as “Hello World!”, achieved identically and independently of platform, is a remarkable and powerful thing.

What is a container? Just something that holds other things - in the context of computation, a system that holds programs. The difference between a container and the computer you're using right now is that a container is abstracted and virtualized - it is independent of the hardware and (external) operating system it runs on.

Follow Along

Once installed, Docker “Hello World!” is as simple as:

docker run hello-world

Try it out! You should see something like:

docker run hello-world

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
ca4f61b1923c: Pull complete
Digest: sha256:ca0eeb6fb05351dfc8759c20733c91def84cb8007aa89a5bf606bc8b315b9fc7
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.
...

But what's really happening?

But the somewhat subtle and crazy thing here is that this message is the single purpose of an entire operating system, contained and reproduced exactly (byte-for-byte) whenever and wherever you run it.

So, while “Hello World!” may not exactly pay the bills, the general idea that it demonstrates is clearly powerful. And it's up to us to use that power to make our code reliable and reproducible.

Challenge

Read up on Docker documentation, and execute something on a Docker container beyond “Hello World!”

Additional Resources

Objective 02 - Create and customize a Dockerfile to build a basic custom container

Overview

Running containers that other people make is useful - there are a lot of premade containers out there. But the real power of Docker is in customizing your own container, and running your own reproducible code in a variety of environments.

To customize Docker, you must write a Dockerfile - a text file “recipe” that specifies the container/Linux distribution you are basing your container on, and then add additional environment setup steps.

Follow Along

Example MVP Dockerfile for Python:

FROM debian

### So logging/io works reliably w/Docker
ENV PYTHONUNBUFFERED=1
### UTF Python issue for Click package (pipenv dependency)
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
### Need to explicitly set this so `pipenv shell` works
ENV SHELL=/bin/bash

### Basic Python dev dependencies
RUN apt-get update && \
  apt-get upgrade -y && \
  apt-get install python3-pip curl -y && \
  pip3 install pipenv
  1. Put the above in a Dockerfile
  2. docker build . -t python
  3. Wait (may take awhile, especially if it's your first container)
  4. docker run -it python /bin/bash

You're now in a reproducible environment with pipenv! When you're done, exit - but note that if you run it again it will actually be a new copy of the container, i.e. you don't see what you did in the earlier version of the container.

If you want to reuse a single container, read up on docker restart and docker attach - but the idea of a container always being clean is actually quite powerful, as that is where reproducible builds come from. With a more elaborate Dockerfile (that specifies an actual package to install and run), you can have a contained and reproducible app that you know runs the same everywhere.

Challenge

Build on the Dockerfile above - add to the RUN command to install specific useful packages and execute them. You may have to read up on docker run, in particular the -p argument to expose a port from the Docker container to localhost - this lets you run a Jupyter notebook server inside Docker and access it from your external operating system.

To see how deep the rabbit hole goes, check out Docker Compose.

Additional Resources

Guided Project

In this guided project, we'll learn how to create Docker containers for reproducible Python environments. Open guided-project.md in the GitHub repository below to follow along with the guided project.

Module Assignment

For this assignment, you'll create Docker containers and Dockerfiles to demonstrate your understanding of containerization and reproducible builds.

Solution Video

Additional Resources

Docker Fundamentals

Advanced Tools

Machine Learning with Docker