container-fu
| by jpic | linux devops dockerHumans have used containers for at least 100,000 years, and possibly for millions of years. The first containers were probably invented for storing food, allowing early humans to preserve more of their food for a longer time, to carry it more easily, and to protect it from other animals. The development of food storage containers was “of immense importance to the evolving human populations”, and “was a totally innovative behavior” not seen in other primates.
Docker has donated it’s code to the Cloud Native Computing Foundation (CNCF), as such, containers are now widely compatible and what differs is really the “userland”, or “toolchain”, which is what we call “the set of tools”. Concurrent systems such as rkt also donated their code, all that was refactored into libcontainer so no matter what tool you use this is what is going to be used behind the scenes.
The famous “Demystifying containers” talks and blog posts by Sascha Grunert is great to gain deeper knowledge about containers and how they interact with the system.
We’ll have a little tutorial here, as a pre-requisite for the article about pipelines.
Control Groups
Control groups allow to manage the resources for a collection of processes, such as resource limit, prioritization in CPU utilization or disk I/O throughput, accounting of a group’s resource usable, control to freeze/checkpoint/restart groups of processes. You are usually not going to worry about that, especially if you’re just working on pipelines.
More information about control groups
Namespaces
Namespaces are what allow to isolate processes from the host, so that processes in different containers are isolated from the host in several ways.
You can list the namespaces on a system with the lsns
command.
More information about Linux namespaces
Network
A process in a network namespace does not see the network interface defined in the host system, instead, it sees network interfaces and routes that have been created inside the namespace.
A container started with the non-default network=host
option will not be
confined in a network namespace but will see the interfaces defined by the
host:
docker run --network=host some/container
This is a quick and dirty solution to fix the issue of “containers do not have access to the network”, aka “NATing is broken”, much more likely to happen with a podman setup than with a docker one.
The problem is that the community is in transition from iptables to nftables,
nftables is supported by firewalld, but not supported by docker… Long story
short, you might run into NATing problems in some configurations,
--network=host
might do the trick.
More information about network namespace
PID
A process in a pid namespace does not see the processes that have been started outside of that namespace.
The first process created in the container has PID1.
This means that when PID1 exits, the container stops.
The command used to start the default PID1 process of a container is stored in
the metadata file, and defined with the CMD
or ENTRYPOINT
statements in a
Dockerfile.
But you can also override that with your own when starting a container with the
--entrypoint
or command
argument:
docker run -it --entrypoint /bin/sh some/container
docker run -it some/container /bin/sh
More information about PID namespace
User
A process in a user namespace does not see the user ids defined in the host os: they are re-mapped.
This means you can map user 0 (root) in a container to user 27382: no privilege at all on the host system.
This is not enabled by default in docker, which runs everything through a docker daemon that has root privileges:
$ docker run -v /:/mnt --rm --user root your.gitlab.registry.hostname/groupname/containers/rhel sh -c 'echo hello > /mnt/root/test'
$ sudo cat /root/test
hello
With podman rootless containers, the above weakness is not exploitable:
$ podman run --tls-verify=false -v /:/mnt --rm --user root your.gitlab.registry.hostname/groupname/containers/rhel sh -c 'echo hello > /mnt/root/test'
sh: /mnt/root/test: Permission denied
However, with podman, in its default configuration, you could if your container are started as root:
$ sudo podman run --tls-verify=false -v /:/mnt --rm --user root your.gitlab.registry.hostname/groupname/containers/rhel sh -c 'echo hello > /mnt/root/test'
$ sudo cat /root/test
hello
Although you should be able to setup sub
Details:
- https://docs.docker.com/engine/security/userns-remap/
- https://www.redhat.com/sysadmin/rootless-podman-user-namespace-modes
- https://www.redhat.com/sysadmin/building-container-namespaces
uts
A process in a uts namespace does not see the host hostname but it’s own that has been defined in that namespace.
You can also define the hostname at runtime with --hostname
option.
More information about UTS namespace
mnt
A process in an mnt namespace does not see the same mounts as the host, it sees the mounts that have been done in the said namespace.
Then, it works like chroot
.
More information about mount namespace
ipc
This namespace prevents processes from accessing shared memory from processes outside the namespace.
Conclusion
So that’s basically how container work behind the scenes:
- a bunch of namespaces and control groups are created
- the container image is extracted and mounted as
/
- a bunch of network interfaces and so on are created in there
- the container stops when its PID1 exits
You could even invent a container system with a single bash script, and some people actually have! And it’s really interresting to see what kind of commands they use to work with the namespaces.
Images
A container image is basically a tarball containing a root filesystem and metadata files.
It all starts with a container image specification: a string pretty much like a URL, that specifies the image repository, name, and version.
For example, docker pull some.gitlab.com/some-repo
will attempt to download
the image some-repo
from some.gitlab.com
, as you can see the name also
serves as specification as to where to push and pull a container image.
When the repository is not specified, then it’s up to the container system to decide what default repository (hostname) to use, for docker it’s hub.docker.io and for podman it’s quay.io.
Doing docker pull nginx
is the same as docker pull hub.docker.io/nginx
, and
of course that won’t work in constrained environments such as those we work
with daily here.
The solution for that is using the docker hub proxies defined in Artifactory.
Instead of docker pull nginx
, do docker pull your.registry/docker/nginx
, because your.registry/docker
proxies
hub.docker.io
.
Everybody can push images to the public docker hub, you should avoid using
images hosted there at all cost, except for base OS images that are published
by the companies themselves, such as RedHat or Ubuntu images.
Other constructor published images, such as for MySQL and the likes, should
also be fine.
Building images with Containerfile
The easiest way to build an image is to create a Dockerfile or Containerfile.
From the man page:
The Containerfile is a configuration file that automates the steps of creating a container image. It is similar to a Makefile. Container engines (Podman, Buildah, Docker) read instructions from the Containerfile to automate the steps otherwise performed manually to create an image. To build an image, create a file called Containerfile.
The Containerfile describes the steps taken to assemble the image. When the Containerfile has been created, call the buildah bud, podman build, docker build command, using the path of context directory that contains Containerfile as the argument. Podman and Buildah default to Containerfile and will fall back to Dockerfile. Docker only will search for Dockerfile in the context directory.
You won’t be creating an image from scratch this way, but you’ll be inheriting from a base image, such as RHEL or Fedora or whatever.
For this, you’ll have to build an image with working repositories, this means setting up satelite or artifactory repositories. So that’s exactly what we’re going to do here.
Let’s create a Containerfile:
:language: docker
We will also need the mentioned your.repo
:
:language: ini
podman build
Build with command:
podman build .
FROM
First, podman is going to pull the base image, in our case from carma, this step does not produce a new image, it only downloads one in the local cache:
STEP 1/3: FROM your.registry/docker/redhat/ubi8
Trying to pull your.registry/docker/redhat/ubi8:latest...
Getting image source signatures
Copying blob b92727ef7443 done
Copying config 270f760d3d done
Writing manifest to image destination
Storing signatures
RUN
Then, it’s going to execute that command we have as the second step, specified
after the RUN
keyword.
STEP 2/3: RUN rm -rf /etc/yum.repos.d/*
--> ea466a73415
And the resulting image hash hash ea466a73415
.
COPY
In the third step, it’s copying a file from the build directory into the
container directory /etc/yum.repos.d
, producing the resulting image hash
f56660e3c03
.
STEP 3/3: COPY your.repo /etc/yum.repos.d
COMMIT
--> 2f131c88217
2f131c882177f3912dc7c9a2fb9e4225a2bc91624ca093d3e7ac6f74ae14b0c8
The build command always sends the contents of the whole directory we're
building from to the builder as "context", this may not always be desirable,
especially if there's a lot of data in the directory that building doesn't
need, in this case, create a [`.containerignore`
file](https://github.com/containers/common/blob/main/docs/containerignore.5.md)
The resulting image short hash is 2f131c88217
.
Create a container with podman run
We can run that image in a temporary container that will be removed when we exit:
$ podman run --rm -it 2f131c88217 bash
[root@666eadf6b817 /]# yum install -y vim
Note that we can run a container from any of the hashes that the builder outputted, this is especially useful to do during development/debugging.
Layer caching
A container image consists of layers. A layer is a tarball with contents and metadata. An image can contain many layers, extracted on top of each other to produce the final rootfs. This is what allows layer caching.
Let’s add the following to Containerfile
:
RUN yum install -y net-tools iputils iproute
Then, rebuild the container:
podman build .
As you can see, it’s not rebuilding from scratch, it’s starting at the last valid layer from cache:
STEP 1/4: FROM your.registry/docker/redhat/ubi8
STEP 2/4: RUN rm -rf /etc/yum.repos.d/*
--> Using cache ea466a73415f707af68764e15d9d6c792bf81009f9cae73cc4a1e7c79a134ffd
--> ea466a73415
STEP 3/4: COPY your.repo /etc/yum.repos.d
--> Using cache 2f131c882177f3912dc7c9a2fb9e4225a2bc91624ca093d3e7ac6f74ae14b0c8
--> 2f131c88217
STEP 4/4: RUN yum install -y net-tools iputils iproute
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
[...]
Installed:
iproute-5.18.0-1.el8.x86_64 iputils-20180629-10.el8.x86_64
libbpf-0.5.0-1.el8.x86_64 libmnl-1.0.4-6.el8.x86_64
net-tools-2.0-0.52.20160912git.el8.x86_64 psmisc-23.1-5.el8.x86_64
Complete!
--> d0216453d6a
d0216453d6ab00bcc6bcea32be90a28918f3c967e05419c29e5284ba5977bed0
Note that if you change a file that is copied inside the container, such as
carm.repo
in our case, the container builder will consider that the cached
layer corresponding to the COPY
statement is invalid: it will rebuild that
layer and all the next ones.
Other Containerfile instructions
Read containerfile manual for the complete list of instructions which we’re not going to cover here.
Tag an image with podman tag
Instead of referring to images by their hashes, we can use arbitrary tags, which we introduced earlier in the container image introduction.
Let’s tag the image we have built as rhel
:
podman tag d0216453d6a rhel
And we may now run a container with that image by tag name:
$ podman run --rm -it rhel bash
[root@c6e94dc73b69 /]# exit
You can see container images you have with the podman images
command:
$ podman images
REPOSITORY TAG IMAGE ID CREATED SIZE
localhost/rhel latest 2f131c882177 2 hours ago 214 MB
your.registry/docker/redhat/ubi8 latest 270f760d3d04 3 weeks ago 214 MB
As you can see, the repository for your rhel
image was automatically set
to localhost.
Publishing images with podman push
First, you’ll have to authenticate against the repository you want to push to. GitLab provides a docker registry, so I’ll be using that for the example.
$ podman login --tls-verify=false your.gitlab.registry.hostname
Username: jpic
Password:
Login Succeeded!
Now, let’s tag the rhel image to be part of a repository:
podman tag rhel your.gitlab.registry.hostname/groupname/containers/rhel-tutorial
We can now push it to the repository:
$ podman push --tls-verify=false your.gitlab.registry.hostname/groupname/containers/rhel-tutorial
Getting image source signatures
Copying blob 43a34f146a69 done
Copying blob 6fbe1af03e06 done
Copying blob bccb88911f57 done
Copying config 2f131c8821 done
Writing manifest to image destination
Storing signatures
Container lifecycle
Containers are started on the command line with the run
command. Let’s run a
container in the background for 60 seconds:
$ podman run --detach rhel sh -c 'sleep 30'
525f8d0b9a03b72a82897fb2a3beec8a8e5265952f8f867ccdcf8b7dde283a8d
podman ps
This outputs the container id. Run the ps
command:
$ podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
525f8d0b9a03 localhost/rhel:latest sh -c sleep 60 6 seconds ago Up 5 seconds ago romantic_jang
As you can see, the container is up and running, and has been assigned a random name. Why? Because we don’t care about containers. They are not supposed to hold any persistent data. We’re supposed to be able to trash container and re-create it at any time. Actually, that’s even how we do version upgrades! No more “Can’t upgrade because it will break the system” because containers are isolated from the system!
podman ps -a
Anyway, wait a minute and run the ps
command again:
$ podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
It’s gone! Well, it’s not running anymore, to see all stopped containers we run:
$ podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
525f8d0b9a03 localhost/rhel:latest sh -c sleep 60 1 minute ago Exited (0) 1 minute ago romantic_jang
podman start
To start a stopped container, pass the container name or id to the start
command:
podman start romantic_jang
Now, you’ll see the container is up and running again:
podman ps
podman rm
That stopped container is still taking up disk space, and disk space is the resource that containers purposely sacrify, because it’s cheap. Nonetheless, it’s not infinite so we have to clean up.
To remove a stopped container, run:
podman rm romantic_jang
Run podman ps -a
again and you’ll see it’s gone:
$ podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
podman run --restart
When we deploy a persistent service in a container, we want it to restart automatically, in case of failure, or after a reboot. To do this, we need to set the restart policy when we create the container:
$ podman run --restart always --detach rhel sh -c 'sleep 30'
4de3ffb42c7cfdf330e56d3e2aa38b26eb23b818d9d6335d28132495de12ef38
$ podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4de3ffb42c7c localhost/rhel:latest sh -c sleep 30 7 seconds ago Up 6 seconds ago gallant_mahavira
$ sleep 30; podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4de3ffb42c7c localhost/rhel:latest sh -c sleep 30 42 seconds ago Up 9 seconds ago gallant_mahavira
As you can see, the container is restarting automatically.
podman run
arguments
`podman run` basically has a syntax of `podman run <run-options>
<repository/image> <container command>`, which means that you **MUST** pass
options for the run command **before** the image name.
This won't work: `$ podman run rhel sh -c 'sleep 30' --restart always`
This will: `$ podman run --restart always rhel sh -c 'sleep 30'`
podman run -it
If you want to enter a bash shell in a new container, this won’t do the trick:
podman run rhel bash
Because bash will think that we are not running in interactive mode, parse the script from stdin, figure it has none, and exits with 0 status.
Tu run an interactive command such as a console shell in a new container, use
-it
:
15/03 2023 16:17:16 jpic@TGSHZABDDDEV04I ~
$ podman run -it rhel bash
[root@f28a6591251f /]# exit
15/03 2023 16:18:31 jpic@TGSHZABDDDEV04I ~
$
podman run --rm
If you want the container to be deleted automatically after exit, use the
--rm
option of podman run:
$ podman run -it --rm rhel bash
[root@40483dd07829 /]# exit
exit
As you can see, the container is definitely gone:
$ podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
$
podman exec
We can also enter a running container by executing a new process in it. Run this container in detached mode:
$ podman run --detach rhel sh -c 'sleep 300'
95843b20ec34913782176dd1fe86fe99fd2ee48051366a9464cbe34506bb8238
You can see it in ps
:
$ podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
95843b20ec34 localhost/rhel:latest sh -c sleep 300 5 seconds ago Up 4 seconds ago funny_cori
And enter the container with the exec
command:
$ podman exec -it funny_cori bash
[root@95843b20ec34 /]#
And we can see that we are indeed in the container that has for PID1 a sleep of 300:
[root@95843b20ec34 /]# cat /proc/1/cmdline
/usr/bin/coreutils--coreutils-prog-shebang=sleep/usr/bin/sleep300
Volumes
Earlier, we said that containers should not contain any persistent data, they are considered expendable.
To persist data, containers need volumes. You can use container volumes, which can be interesting if they are backed by some network storage.
Otherwise, the easiest is to just mount a local directory into a container.
We can do so with the --volume
option of the run
command. With the
following command, you can be in the same directory as before entering the
container:
podman run -it --rm --volume $(pwd):$(pwd) --workdir $(pwd) rhel bash
Try to run pwd
and ls
in the container.
Networks
I’m not really sure what to say here.
First, you probably won’t be dealing with networks if you’re just using containers to make pipelines.
As I was writing this part of the tutorial, it stroke me that podman network isolation is not working, so I thought I got it all wrong, and tried with docker and indeed: network isolation works in docker.
The theory is that you can create a container network, with docker or podman
network create
, and start containers within that network:
- they are able to see each other, as they are on the same IP network
- they are able to resolve the IP by hostname for each other
- they can’t see anything about containers that are not on that network.
And indeed, that’s how it works with docker, and I guess, how it’s supposed to work with podman.
But with podman, none of this works as expected:
- containers are unable to resolve any other container by hostname
- containers can ping any other container even if not on the same network, probably the bridge interface is not filtering
See for yourself, in podman:
$ podman network create test-network
$ podman run -it --network test-network rhel bash
[root@8aa0d653bae7 /]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether a2:47:ec:c2:f2:6f brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.89.0.2/24 brd 10.89.0.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::a047:ecff:fec2:f26f/64 scope link
valid_lft forever preferred_lft forever
$ podman run -it --network test-network rhel bash
[root@e0e73678a656 /]# ping -c1 10.89.0.2
PING 10.89.0.2 (10.89.0.2) 56(84) bytes of data.
64 bytes from 10.89.0.2: icmp_seq=1 ttl=64 time=0.107 ms
Even the first container can see the other one:
[root@8aa0d653bae7 /]# ping -c1 10.0.2.100
PING 10.0.2.100 (10.0.2.100) 56(84) bytes of data.
64 bytes from 10.0.2.100: icmp_seq=1 ttl=64 time=0.163 ms
--- 10.0.2.100 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.163/0.163/0.163/0.000 ms
After some research, it turns out this is being discussed in podman issue #17061, bottom line is:
- use
--internal
withpodman network create
to have isolation - this part of podman is being completely rewritten anyway
podman inspect
Last but not least, the inspect
command dumps the container configuration,
which, I hope, after reading this tutorial, will be mostly understandable to
you.
Try different podman run
commands and see how the output of podman inspect
differs.