Docker Containers As Machine Learning Environments
(Part One)

Machine Learning Is Difficult. Installing Packages and Managing Environments Should Be Fast and Intuitive.

The computational tasks associated with machine learning (ML) are complex. Building a simple linear classifier requires acquiring large amounts of labeled data, that must then be sanitized and split into a train and test set. Various algorithms are then employed to tune the feature parameters to achieve an optimal fit. The models are then tested to see how well they predict labels on the test set.

Another layer of complexity is environment management. A classifier, such as the one described above, may test well on your machine, but that does not mean that it will run at all on a colleague’s computer or in the cloud. The process of managing environments should be transparent and configurable.

While many popular solutions exist (e.g., Anaconda, VirtualEnv), this guide will focus on using containers to manage environments. This guide is aimed at ML practitioners who are interested in utilizing containers as a way to manage environments. You will need Docker Desktop and be comfortable using a shell.

The benefits of virtualization

As stated above, data scientists must be able to easily switch between application contexts without affecting other projects on the same machine.

Fortunately, virtualization has been around a long time and provides a mechanism for running many applications (each with their own context) on the same machine. Virtualization simulates hardware architectures with software so that applications built for one architecture can run on another architecture.

Analogy: MagicAirlines

A magic jumbo jet allows any guest pilot to fly by translating the pilot’s actions into equivalent actions on the (host) jumbo jet. So, a turbo prop pilot can fly a MagicAirlines jumbo jet without caring about mechanical differences between the two planes.
The pilot has turned on the fasten all seatbelts sign.

Strategies for virtualization, such as virtual machines (VMs) and containers have been successfully employed in industry for decades. Virtualization is an important concept that will help illustrate what containers are for and what problems they solve.

Are containers like virtual machines?

They are similar. Containers share the host OS and kernel instead of providing their own. This ability to share the kernel makes containers lightweight because they need only the binaries, libraries and application code to run. Virtual machines are typically larger because they contain an entire guest OS in addition to all binaries, libraries and application code. Containers are often described as lightweight and fast in comparison to the heavier, but more secure virtual machine.

A comprehensive comparison of containers vs virtual machines is not in scope for this guide.

For brevity, I present the common view that containers offer better performance and virtual machines are more secure, but utilize more resources[1]. This distinction is not so simple. Containers and VMs were developed in tandem and the scope of their features and history are intertwined[2]. Many vendors attempt to implement the strengths of the collective virtual architectures (speed, security and portability).

Containers vs container images

Lets get started with containers. It is important to understand the difference between a container image and a container. You can think of container images as complete environments containing everything necessary to run a process. The running process is the container.

An image contains all scripts, libraries and application code to do a job. A Docker container image is built from a dockerfile that provides instructions to execute when building the image. We will see a dockerfile in a later example. The dockerfile may specify what packages to install and what scripts to run during the build process. The image is provided with a special argument — called ENTRYPOINT — which is the main command to execute when the container is started. Images can be shared and extended from other container images to create new images.

It is important to note that the container image by itself does nothing. It is essentially a file and is idle. It must be started by the runtime engine to become a container in the “running” state.

A container is a process running in isolation in the context of the container image. It is run by passing the image name to the runtime engine (e.g., Docker). Containers usually boot very quickly because all of the packages and scripts were installed as part of the build process. Containers have state and can be: started, stopped, paused or killed. They are are meant to be ephemeral and stateless by design[3].

Why would I use containers for machine learning?

In a concrete example, suppose you were working on building a model using Python 3.6 and Numpy 1.19.5. Later in the day, you are asked to review a colleague’s python notebook. You get the code and run it on your machine. Everything works fine until you get to this cell:

import numpy as np
rng = np.random.default_rng()
x = np.arange(24).reshape(3, 8)
y = rng.permuted(x, axis=1)
Traceback (most recent call last):
File "<stdin>" line 1, in <module>
AttributeError: 'numpy.random._generator.Generator' object has no attribute 'permuted'

You discover that this notebook was written with Numpy 1.21.1 — which has the permuted function added in v1.20.0. Your version of Numpy does not have this permuted function and so you are getting an error.

It would be nice if you could easily "swap" your dependencies to match your colleague's environment, then swap back.

This is what containers allow you to do: create and share environments. If you and your colleague had decided to utilize containers, running your colleague’s code on your machine would be trivial:

docker pull py-numpy:v1 # colleague's image
docker run -it --rm py-numpy:v1

The above two commands pull the image from a repository and the second command runs the container. Goodbye dependency issues. The code will run exactly like it did on your colleague’s machine.

Getting started with the Docker CLI

To demonstrate how containers are used, we will run a very simple container that does a simple job. Make sure Docker Desktop is installed and running. We will be using the using the Docker CLI available in your shell. You can test that docker is running by typing the following command:

> docker --version

The above command should return something like Docker version 20.10.7

Getting a container image

Now that docker is running, lets grab a container image off DockerHub — a registry of container images provided by Docker. It is free and easy to use. There are thousands of available containers for you to browse later. Lets grab the Official Python container image.

> docker pull python

You should see that your system is downloading this container image. Once this is done, list the images on your machine by typing:

> docker images

You should see the the python image with the tag latest in your list of images.

REPOSITORY                     TAG               IMAGE ID       CREATED       SIZE
python                         latest            59433749a9e3   1 day ago    886MB

Run the container

Now that you have an image, Lets run it like so:

> docker run -it --rm python:latest

By default, the docker run command will attach the console to the container process’s standard input, output and error. For interactive processes (e.g., a shell or interpreter) you must use the -it so that a pseudo-tty is allocated and signals are passed on from your terminal to the container. The --rm option allows docker to automatically clean up the container and remove the file system when the container exits.

You should see the python interpreter’s command prompt:

[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

Try executing some python code here to see it if python is working:

>>> 4 + 4
8

So, even though you are in your shell, you are connected to the container. That container is running an interactive process (i.e., the python interpreter). To exit out of this, type ctrl+D.

The default container process

How does the container image know what command to execute? This is provided by the default command argument in the container image. You can find out what the default command on a container image is by using inspect as follows:

> docker inspect --format='{{.Config.Cmd}}' python:latest
[python3]

So you can see that the default command for the python image is python3. It can also be overwritten in a docker run command. If we wanted python to just print the date and then exit, we could pass the python3 command with the code to execute:

> docker run -it --rm python:latest \
  python3 -c "from datetime import datetime; print(datetime.now().ctime())"
  # output: Wed Jun 11 16:29:59 2021

You will notice that the process exits right away. This is because our command that runs in the container is no longer interactive. The code is executed and then the process exits and the container dies.

Conclusion

While the final example was not very exciting, hopefully it demonstrates the power of containers. In part two of this guide, I’ll show you how to build your own container image with the dependencies that you need. I will show you how to mount a folder into a container when you run it. Also, I will demonstrate how to share this container with the world.


  1. Containers vs. Virtual Machines (VMs): What’s the Difference? ↩︎

  2. Randal, A. (2019). The Ideal Versus the Real: Revisiting the History of Virtual Machines and Containers. sarXiv:1904.12226 [cs.NI], Apr 2019 ↩︎

  3. Best practices for writing Dockerfiles ↩︎