What are we trying to do
We would like to have a consistent environment to develop and deploy applications in across devices so, in theory, we can run the code everywhere and have it work - not "it works on my machine". Different devices may have different packages installed at different versions. We don't want the application to break because we updated the system. You may also not want to configure servers on a development machine. We need a way to repeatably specify what a system should look like in order to run the application. We could then build such a system on any device and run the application in that without worrying about what's outside it. We would also like to introduce separation between applications for security reasons - sandboxing
Operating system
An OS is fundamentally a resource manager, which allocates resources such as processor time, memory, files, or device access to different processes. The kernel is the actual program loaded by the bootloader which provides these services.
Virtualization and emulation
- Emulation: Software which imitates the behaviour of a piece of hardware
- Virtualization: Logical abstraction of the hardware
One approach would be to run multiple kernels with different environments running in each. The problem is that the kernel has special access to the CPU. When an x86 based system starts the processor is in real mode where the operating system can use special instructions and registers not available in protected mode which it switches to in order to run userspace code. The separation, theoretically, limits what applications can do to interfere with each other or the system. This obviously causes a problem for trying to run two kernels at the same time, the second one also needs access to real mode, and even then would overwrite the setup done by the first.
One way of addressing this is emulation, we go a step up and write a program which acts like a processor and then use it to interpret the binary files. The problem is that this is slow. An improvement would be to natively execute the parts of the code which we can run in protected mode, and replace real mode instructions with a different set of instructions which emulates just that behaviour in our virtualization program. This is the difference between emulation and virtualization, virtualization is just an abstraction and doesn't required the hardware to be emulated. More recently processor features like VT-x added instructions for entering a virtual execution mode which the guest sees as running in real mode.
Chroot
Another approach would be to only have one kernel running but try to separate the project from the rest of the system.
The chroot
command is useful for trying to repair a system from a bootable drive and is also used to package software for the distribution's repositories
sudo debootstrap --variant=minbase jammy . http://archive.ubuntu.com/ubuntu/
This command fetches the base system image for ubuntu jammy and writes the files to the current directory.
sudo chroot .
This command only changes the root path so that /
now refers to what was the current directory
We can still access other resources such as processes, and devices such as networking
ps -a
apt install iproute2 -y
ip addr
At first it may look like we've at least hidden the filesystem, we can't ls or cd above the root, but we can get around that.
apt install python3
>>> import os
>>> os.getcwd()
'/'
>>> os.listdir()
['proc', 'home', 'mnt', 'usr', 'sbin', 'media', 'srv', 'boot', 'opt', 'bin', 'lib32', 'libx32', 'tmp', 'root', 'run', 'lib64', 'lib', 'dev', 'etc', 'sys', 'var']
If are currently in the root directory in the chroot it looks as you would expect. But what if we chroot again?
>>> os.chroot('home')
>>> os.getcwd()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory
>>> os.listdir()
['proc', 'home', 'mnt', 'usr', 'sbin', 'media', 'srv', 'boot', 'opt', 'bin', 'lib32', 'libx32', 'tmp', 'root', 'run', 'lib64', 'lib', 'dev', 'etc', 'sys', 'var']
We use /home as an existing directory we can change the root to. Our current working directory did not change, this is why we couldn't use the chroot command. Rather than just changing the root it runs a command in the new root and returns. Now we are now in an invalid path, above the root. But we can still list it. Can we go higher?
>>> os.chdir('..')
>>> os.listdir()
['jail']
Since we are outside the root, we are no longer constrained to stay within it. Note that this particular method must be run from a root shell otherwise we wouldn't be able to perform the chroot into the subdirectory
Containers
A container is a sandboxed group of processes that are isolated from the host system. In the early 2000s linux added namespaces, cgroups (control groups), and seccomp (secure computing mode) which are the main functionalities containers are built on, though they weren't really feature complete enough until around 2013.
Namespaces provide the ability to create separate sets of resources (of supported types) for different processes. For example creating a new pid namespace for a process allows it and it's children to have and interact with pid numbers separate from the system running in the default namespace.
Cgroups allow the kernel to limit resources, such as processor, memory, disk bandwidth, etc... for a group of processes.
Seccomp allows the kernel to restrict a processes access to system calls.
With the ability to create a separate set of common resources, limit resource usage, and block access to system calls which manipulate resources we don't have namespaces for, we have everything we need to isolate a set of processes from the rest of the system.
Container vs virtualization
Every running container shares the same os kernel, along with the host. This makes it easy to share resources between them since they are all running within the same resource manager. In contrast, virtualization requires creating an emulated hardware interface the guest can use to access the outside world. While this has improved it can still be a pain to do something as simple as share files with the guest. The higher overhead of virtualization also reduces performance.
The main downsides of containers are that since they use the same kernel, a kernel exploit can lead to a container escape. Also, only resources which have namespace support can be isolated from the host and other containers. For example, time and the kernel keyring do not have namespaces. By default Docker blocks access to these syscalls so that the container remains isolated, but changes to these resources will affect the host and other containers.
Images and containers
While a container is a running group of processes, an image is a filesystem (similar to our chroot) to use as the environment as well as other metadata such as the user to run as, working directory, environment variables, etc... A container can be thought of as a running instance of an image.
Running a container
- We can run a container using the official ubuntu image
docker run ubuntu Unable to find image 'ubuntu:latest' locally latest: Pulling from library/ubuntu 2ab09b027e7f: Pull complete Digest: sha256:67211c14fa74f070d27cc59d69a7fa9aeff8e28ea118ef3babc295a0428a6d21 Status: Downloaded newer image for ubuntu:latest
- Notice that it doesn't find an image called ubuntu locally, so it pulls from the docker hub
- It just exited, by default
- We can run it interactively which will connect our shell's standard streams to the container
docker run -i ubuntu ls bin boot dev etc home lib lib32 lib64 libx32 media mnt opt proc root run sbin srv sys tmp usr var
- But we don't get a shell prompt because bash is not aware it's running in a terminal
- Docker can be told to set up a psuedo terminal with
docker run -it ubuntu root@a6b978a183b9:/# ls bin boot dev etc home lib lib32 lib64 libx32 media mnt opt proc root run sbin srv sys tmp usr var
- You can see that you are running as the root user in the container and the hostname is the container id
- If you look around the file system you will see that it's different to your system
- There is also only 1 process running - bash
root@d0d4b391e433:/# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 4624 3612 pts/0 Ss 04:32 0:00 /bin/bash root 9 0.0 0.0 7060 1600 pts/0 R+ 04:38 0:00 ps aux
Deleting containers
- Stopped containers are still around and can be resumed
- -a will list stopped containers, by default only running containers are listed
docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES d0d4b391e433 ubuntu "/bin/bash" 31 seconds ago Up 30 seconds hardcore_euler
- Can delete the containers we're done with using rm
- Will work with the container name or id
docker rm d0d4b391e433
- Note that you need to stop a container before it can be deleted
Pruning containers
If you have multiple stopped containers that you want to clean up you can use prune, which will delete everything that's stopped
docker container prune
Naming containers
We can name our container to make it easier to refer to later. Names must be unique and we can not run a container with the same name before deleting the old one.
docker run -it --name=ubuntu-container ubuntu
Attaching and detaching from containers
We can use ctrl-p followed by ctrl-q to detach the terminal's streams from the container. We can reattach with
docker attach ubuntu-container
Starting and stopping containers
We can stop a container with docker stop
and restart with docker start
. This is not a suspend - docker will send a SIGTERM to the container process followed by a SIGKILL if it doesn't stop after a grace period.
Executing a shell in a container
We can run a command in a running container with docker exec
docker exec -it ubuntu-container bash
Publishing ports
- Lets try running a server in the container
root@d0d4b391e433:/# apt update && apt install ncat root@d0d4b391e433:/# ncat -l 1337
- And trying to connect to it on the host
ncat 127.0.0.1 1337 Ncat: Connection refused.
- The containers has a network namespace which has it's own interfaces, routing tables, firewall rules, etc...
- We need to publish the ports to be accessed outside the container
- Format: ip:host port:container port
- If ip isn't specified it the ports will be published on every interface
- If the host port isn't specified it will be mapped to a random host port
docker run -itp 1337:1337 ubuntu
Mounting
- We can mount part of the filesystem into the container to share files with it
- -v for volume
- -v /host path:/container path
docker run -itv ~/:/volume ubuntu
- Note that the slash after the tilde is required to make a valid tilde prefix - see tilde expansion in the bash manual
Creating an image
To tell docker how to create an image we write a dockerfile which is a series of instructions for docker to follow to create the image
Create a file called Dockerfile
The dockerfile must start by inheriting an existing image
FROM ubuntu
We can run commands to install dependencies
RUN apt update && apt install ncat
It's recommended to not run your application as root in the container
RUN useradd -ms /bin/bash app
USER app
We could application files into the container
COPY . /app
We can set the directory further commands should run in
WORKDIR /app
Document a port that should be published
EXPOSE 1337
And specify a command to run when the container starts
CMD ncat -l 1337
We can build the image using
docker build . -t test-image
[+] Building 48.8s (10/10) FINISHED
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 188B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/ubuntu:latest 0.0s
=> CACHED [1/5] FROM docker.io/library/ubuntu 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 32B 0.0s
=> [2/5] RUN apt update && apt install -y ncat 47.5s
=> [3/5] RUN useradd -ms /bin/bash app 0.6s
=> [4/5] COPY . /app 0.1s
=> [5/5] WORKDIR /app 0.1s
=> exporting to image 0.4s
=> => exporting layers 0.4s
=> => writing image sha256:d63525149978da6ac92d55fbdfde54f1be80aee2ce77ba542e7ecb45595776af 0.0s
=> => naming to docker.io/library/test-image
-t
specifies a tag we can use to refer to the image later, rather than having to query the generated id
We can then run our image as before, note that EXPOSE in the dockerfile doesn't actually publish the port
docker run -itp 1337:1337 --name=test-container test-image
There are two different forms for the cmd command. Shell form
CMD ncat -l 1337
And exec form
CMD ["ncat", "-l", "1337"]
Shell form will be run with /bin/sh -c
and will also interpolate environment variables.
Exec form will treat the first value as the program to run followed by arguments to pass.
If an entrypoint is defined with the ENTRYPOINT instruction then CMD in exec form will pass the values as arguments to that instead. This allows an image to specify e.g. a server to start with ENTRYPOINT and default arguments using CMD which can be overridden by the user.
Layers and caching
Images are stored as layers, which store the changes made to the previous layer. Image layers are readonly, containers add another filesystem layer which is writeable to that container. If a file in the image is modified, it is copied to the container layer.
Each instruction in the Dockerfile creates a new layer, which can contain filesystem and metadata changes. Metadata includes the user to run as, working directory, environment variables, etc... We can see the layers using the following command
docker image history test-image
IMAGE CREATED CREATED BY SIZE COMMENT
d63525149978 About an hour ago CMD ["/bin/sh" "-c" "ncat -l 1337"] 0B buildkit.dockerfile.v0
<missing> About an hour ago EXPOSE map[1337/tcp:{}] 0B buildkit.dockerfile.v0
<missing> About an hour ago WORKDIR /app 0B buildkit.dockerfile.v0
<missing> About an hour ago COPY . /app # buildkit 149B buildkit.dockerfile.v0
<missing> About an hour ago USER app 0B buildkit.dockerfile.v0
<missing> About an hour ago RUN /bin/sh -c useradd -ms /bin/bash app # b… 334kB buildkit.dockerfile.v0
<missing> About an hour ago RUN /bin/sh -c apt update && apt install -y … 47MB buildkit.dockerfile.v0
<missing> 7 weeks ago /bin/sh -c #(nop) CMD ["/bin/bash"] 0B
<missing> 7 weeks ago /bin/sh -c #(nop) ADD file:c8ef6447752cab254… 77.8MB
<missing> 7 weeks ago /bin/sh -c #(nop) LABEL org.opencontainers.… 0B
<missing> 7 weeks ago /bin/sh -c #(nop) LABEL org.opencontainers.… 0B
<missing> 7 weeks ago /bin/sh -c #(nop) ARG LAUNCHPAD_BUILD_ARCH 0B
<missing> 7 weeks ago /bin/sh -c #(nop) ARG RELEASE 0B
Docker uses the layers to cache the build process, when building docker recreates all the layers after the changed layer. If we re-run the build you can see that the layers were found in the cache.
docker build . -t test-image
[+] Building 0.1s (10/10) FINISHED
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 188B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/library/ubuntu:latest 0.0s
=> [1/5] FROM docker.io/library/ubuntu 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 32B 0.0s
=> CACHED [2/5] RUN apt update && apt install -y ncat 0.0s
=> CACHED [3/5] RUN useradd -ms /bin/bash app 0.0s
=> CACHED [4/5] COPY . /app 0.0s
=> CACHED [5/5] WORKDIR /app 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:d63525149978da6ac92d55fbdfde54f1be80aee2ce77ba542e7ecb45595776af 0.0s
=> => naming to docker.io/library/test-image
Docker will rerun the step if the line changes, or in the case of ADD and COPY it checks the hashes of the files you are copying into the container and will re-copy if they have changed. Also note that this cache is shared between images and is not keyed on dockerfile location, or tag, or anything.
Deploying a node app
Let's create a basic express server
Make a new node project with
npm init
Install express
npm install express
Add the following to src/index.js
const express = require("express");
const server = express();
server.get("/", (req, res) => {
res.send("Hello world");
});
server.listen(3000);
Add a start script to package.json
{
"name": "app",
"version": "1.0.0",
"description": "",
"main": "index.js",
"scripts": {
"start": "node old/index.js"
},
"author": "",
"license": "ISC",
"dependencies": {
"express": "^4.18.2"
}
}
Check that the server is working by running
npm start
And visiting 127.0.0.1:3000
in a web browser
Now let's create a basic Dockerfile
FROM node
WORKDIR /app
COPY . .
RUN npm install
CMD ["npm", "start"]
docker build . -t nodeapp
docker run -itp 3000:3000 nodeapp
This works, but we can improve it.
Firstly, we are using a larger image than necessary, and have not specified a version which means that next time we build we could get a different result if there's an update.
Let's switch to the alpine3.17
tag
FROM node:alpine3.17
WORKDIR /app
COPY . .
RUN npm install
CMD ["npm", "start"]
We shouldn't run the process inside the container as root. The node container creates a user called node
which we can switch to.
We also need to chown the copied files to be owned by the node user.
FROM node:alpine3.17
USER node
WORKDIR /app
COPY . .
RUN npm install
CMD ["npm", "start"]
We should also stop copying the node_modules
folder into the container. We can add a .dockerignore file which contains patters to specify files which should not be copied.
echo 'node_modules/' > .dockerignore
Next, notice that every time we change the source file we need to reinstall the dependencies, since we invalidated the layer cache.
echo -e "\n" >> src/index.js
docker build . -t nodeapp
[+] Building 2.7s (9/9) FINISHED
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 110B 0.0s
=> [internal] load metadata for docker.io/library/node:alpine3.17 0.9s
=> [1/4] FROM docker.io/library/node:alpine3.17@sha256:cc4e8f3d78a276fa05eae1803b6f8cbb43145441f54c828ab14e0c19dd95c6fd 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 27.77kB 0.0s
=> CACHED [2/4] WORKDIR /app 0.0s
=> [3/4] COPY . . 0.1s
=> [4/4] RUN npm install 1.4s
=> exporting to image 0.1s
=> => exporting layers 0.1s
=> => writing image sha256:7ea580abd1c10a42d81c81fa44045408bc6fedb4eda999851cdce7467510a524 0.0s
=> => naming to docker.io/library/nodeapp
We could copy the package.json
and package-lock.json
in first and run the install before copying the source.
FROM node:alpine3.17
USER node
WORKDIR /app
COPY package.json package-lock.json .
RUN npm install
COPY . .
CMD ["npm", "start"]
Notice that this time the npm install
is cached
echo -e "\n" >> src/index.js
docker build . -t nodeapp
[+] Building 1.2s (10/10) FINISHED
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 196B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 54B 0.0s
=> [internal] load metadata for docker.io/library/node:alpine3.17 1.0s
=> [1/5] FROM docker.io/library/node:alpine3.17@sha256:cc4e8f3d78a276fa05eae1803b6f8cbb43145441f54c828ab14e0c19dd95c6fd 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 351B 0.0s
=> CACHED [2/5] WORKDIR /app 0.0s
=> CACHED [3/5] COPY --chown=node:node package.json package-lock.json . 0.0s
=> CACHED [4/5] RUN npm install 0.0s
=> [5/5] COPY --chown=node:node . . 0.1s
=> exporting to image 0.1s
=> => exporting layers 0.1s
=> => writing image sha256:f2833dc99e0d12bee5d8961ca6e4adc7aeef93441eb12d9b39db27b89bba2567 0.0s
=> => naming to docker.io/library/nodeapp
You may have noticed that node doesn't seem particularly happy with keyboard interrupt, or docker stopping the container.
> app@1.0.0 start
> node old/index.js
npm ERR! path /app
npm ERR! command failed
npm ERR! signal SIGTERM
npm ERR! command sh -c node old/index.js
npm ERR! A complete log of this run can be found in: /home/node/.npm/_logs/2023-04-26T12_25_00_931Z-debug-0.log
Node is not designed to run as pid 1, this is usually reserved for an init system like systemd which is responsible for setting up services and user sessions among other things. We can use dumb-init which is a simplified init system that will handle the responsibilities of pid 1 correctly.
FROM node:alpine3.17
RUN apk add dumb-init
USER node
WORKDIR /app
COPY package.json package-lock.json .
RUN npm install
COPY . .
CMD ["dumb-init", "npm", "start"]
npm install
will use semantic versioning to decide if it should update to a newer package version, and update package-lock.json. We do not want to do this in the container, builds could suddenly start failing if a package updates. We can use npm ci
to only use the versions in package-lock as well as only downloading production dependencies
FROM node:alpine3.17
RUN apk add dumb-init
USER node
WORKDIR /app
COPY package.json package-lock.json .
RUN npm ci --only=production
COPY . .
CMD ["dumb-init", "npm", "start"]
We should set the environment variable NODE_ENV to production so that packages can use more efficient code rather than code designed for debugging. Not using production mode can also be a security issue if the software allows the client more access for debugging.
FROM node:alpine3.17
RUN apk add dumb-init
USER node
WORKDIR /app
ENV NODE_ENV production
COPY package.json package-lock.json .
RUN npm ci --only=production
COPY . .
CMD ["dumb-init", "npm", "start"]
You should delete temporary files (e.g. downloaded archives) in the same RUN command so they are not added to the layer
Multistage builds
We can use webpack to do a multistage build First install webpack as a dev dependency
npm install --save-dev webpack webpack-cli
We can now build the app using
npx webpack-cli --entry ./old/index.js --mode production --target node
This will bundle all the dependencies in a single file at ./dist/main.js
Add this dist directory to dockerignore
echo 'dist/' > .dockerignore
Let's put this into a dockerfile
FROM node:alpine3.17 as builder
USER node
WORKDIR /app
ENV NODE_ENV production
COPY package.json package-lock.json .
RUN npm ci
COPY . .
RUN npx webpack-cli --entry ./src/index.js --mode production --target node
FROM node:alpine3.17
RUN apk add dumb-init
USER node
WORKDIR /app
ENV NODE_ENV production
COPY /app/dist/main.js .
CMD ["dumb-init", "node", "/app/main.js"]
We create two images, one that we will refer to as builder which runs the build step, and another which will run the application. This allows us to publish a smaller container for the application. It also means that any secrets needed in the build process can be separated from what is published.