container-fu

28.05.2024 13:37 | by jpic | linux devops docker

Humans have used containers for at least 100,000 years, and possibly for millions of years. The first containers were probably invented for storing food, allowing early humans to preserve more of their food for a longer time, to carry it more easily, and to protect it from other animals. The development of food storage containers was “of immense importance to the evolving human populations”, and “was a totally innovative behavior” not seen in other primates.

Docker has donated it’s code to the Cloud Native Computing Foundation (CNCF), as such, containers are now widely compatible and what differs is really the “userland”, or “toolchain”, which is what we call “the set of tools”. Concurrent systems such as rkt also donated their code, all that was refactored into libcontainer so no matter what tool you use this is what is going to be used behind the scenes.

The famous “Demystifying containers” talks and blog posts by Sascha Grunert is great to gain deeper knowledge about containers and how they interact with the system.

We’ll have a little tutorial here, as a pre-requisite for the article about pipelines.

Control Groups

Control groups allow to manage the resources for a collection of processes, such as resource limit, prioritization in CPU utilization or disk I/O throughput, accounting of a group’s resource usable, control to freeze/checkpoint/restart groups of processes. You are usually not going to worry about that, especially if you’re just working on pipelines.

More information about control groups

Namespaces

Namespaces are what allow to isolate processes from the host, so that processes in different containers are isolated from the host in several ways.

You can list the namespaces on a system with the lsns command.

More information about Linux namespaces

Network

A process in a network namespace does not see the network interface defined in the host system, instead, it sees network interfaces and routes that have been created inside the namespace.

A container started with the non-default network=host option will not be confined in a network namespace but will see the interfaces defined by the host:

docker run --network=host some/container

This is a quick and dirty solution to fix the issue of “containers do not have access to the network”, aka “NATing is broken”, much more likely to happen with a podman setup than with a docker one.

The problem is that the community is in transition from iptables to nftables, nftables is supported by firewalld, but not supported by docker… Long story short, you might run into NATing problems in some configurations, --network=host might do the trick.

More information about network namespace

PID

A process in a pid namespace does not see the processes that have been started outside of that namespace.

The first process created in the container has PID1.

This means that when PID1 exits, the container stops.

The command used to start the default PID1 process of a container is stored in the metadata file, and defined with the CMD or ENTRYPOINT statements in a Dockerfile.

But you can also override that with your own when starting a container with the --entrypoint or command argument:

docker run -it --entrypoint /bin/sh some/container
docker run -it some/container /bin/sh

More information about PID namespace

User

A process in a user namespace does not see the user ids defined in the host os: they are re-mapped.

This means you can map user 0 (root) in a container to user 27382: no privilege at all on the host system.

This is not enabled by default in docker, which runs everything through a docker daemon that has root privileges:

$ docker run -v /:/mnt --rm --user root your.gitlab.registry.hostname/groupname/containers/rhel sh -c 'echo hello > /mnt/root/test'
$ sudo cat /root/test
hello

With podman rootless containers, the above weakness is not exploitable:

$ podman run  --tls-verify=false -v /:/mnt --rm --user root your.gitlab.registry.hostname/groupname/containers/rhel sh -c 'echo hello > /mnt/root/test'
sh: /mnt/root/test: Permission denied

However, with podman, in its default configuration, you could if your container are started as root:

$ sudo podman run  --tls-verify=false -v /:/mnt --rm --user root your.gitlab.registry.hostname/groupname/containers/rhel sh -c 'echo hello > /mnt/root/test'

$ sudo cat /root/test
hello

Although you should be able to setup sub

Details:

uts

A process in a uts namespace does not see the host hostname but it’s own that has been defined in that namespace.

You can also define the hostname at runtime with --hostname option.

More information about UTS namespace

mnt

A process in an mnt namespace does not see the same mounts as the host, it sees the mounts that have been done in the said namespace.

Then, it works like chroot.

More information about mount namespace

ipc

This namespace prevents processes from accessing shared memory from processes outside the namespace.

More information about IPC

Conclusion

So that’s basically how container work behind the scenes:

a bunch of namespaces and control groups are created
the container image is extracted and mounted as /
a bunch of network interfaces and so on are created in there
the container stops when its PID1 exits

You could even invent a container system with a single bash script, and some people actually have! And it’s really interresting to see what kind of commands they use to work with the namespaces.

Docker in bash

Images

A container image is basically a tarball containing a root filesystem and metadata files.

It all starts with a container image specification: a string pretty much like a URL, that specifies the image repository, name, and version.

For example, docker pull some.gitlab.com/some-repo will attempt to download the image some-repo from some.gitlab.com, as you can see the name also serves as specification as to where to push and pull a container image.

When the repository is not specified, then it’s up to the container system to decide what default repository (hostname) to use, for docker it’s hub.docker.io and for podman it’s quay.io.

Doing docker pull nginx is the same as docker pull hub.docker.io/nginx, and of course that won’t work in constrained environments such as those we work with daily here.

The solution for that is using the docker hub proxies defined in Artifactory. Instead of docker pull nginx, do docker pull your.registry/docker/nginx, because your.registry/docker proxies hub.docker.io.

Everybody can push images to the public docker hub, you should avoid using
images hosted there at all cost, except for base OS images that are published
by the companies themselves, such as RedHat or Ubuntu images.

Other constructor published images, such as for MySQL and the likes, should
also be fine.

Building images with `Containerfile`

The easiest way to build an image is to create a Dockerfile or Containerfile.

From the man page:

The Containerfile is a configuration file that automates the steps of creating a container image. It is similar to a Makefile. Container engines (Podman, Buildah, Docker) read instructions from the Containerfile to automate the steps otherwise performed manually to create an image. To build an image, create a file called Containerfile.

The Containerfile describes the steps taken to assemble the image. When the Containerfile has been created, call the buildah bud, podman build, docker build command, using the path of context directory that contains Containerfile as the argument. Podman and Buildah default to Containerfile and will fall back to Dockerfile. Docker only will search for Dockerfile in the context directory.

You won’t be creating an image from scratch this way, but you’ll be inheriting from a base image, such as RHEL or Fedora or whatever.

For this, you’ll have to build an image with working repositories, this means setting up satelite or artifactory repositories. So that’s exactly what we’re going to do here.

Let’s create a Containerfile:

FROM your.registry/docker/redhat/ubi8
RUN rm -rf /etc/yum.repos.d/*
COPY your.repo /etc/yum.repos.d

`podman build`

Build with command:

podman build .

`FROM`

First, podman is going to pull the base image, in our case from carma, this step does not produce a new image, it only downloads one in the local cache:

STEP 1/3: FROM your.registry/docker/redhat/ubi8
Trying to pull your.registry/docker/redhat/ubi8:latest...
Getting image source signatures
Copying blob b92727ef7443 done
Copying config 270f760d3d done
Writing manifest to image destination
Storing signatures

`RUN`

Then, it’s going to execute that command we have as the second step, specified after the RUN keyword.

STEP 2/3: RUN rm -rf /etc/yum.repos.d/*
--> ea466a73415

And the resulting image hash hash ea466a73415.

`COPY`

In the third step, it’s copying a file from the build directory into the container directory /etc/yum.repos.d, producing the resulting image hash f56660e3c03.

STEP 3/3: COPY your.repo /etc/yum.repos.d
COMMIT
--> 2f131c88217
2f131c882177f3912dc7c9a2fb9e4225a2bc91624ca093d3e7ac6f74ae14b0c8

The build command always sends the contents of the whole directory we're
building from to the builder as "context", this may not always be desirable,
especially if there's a lot of data in the directory that building doesn't
need, in this case, create a [`.containerignore`
file](https://github.com/containers/common/blob/main/docs/containerignore.5.md)

The resulting image short hash is 2f131c88217.

Create a container with `podman run`

We can run that image in a temporary container that will be removed when we exit:

$ podman run --rm -it 2f131c88217 bash
[root@666eadf6b817 /]# yum install -y vim

Note that we can run a container from any of the hashes that the builder outputted, this is especially useful to do during development/debugging.

Layer caching

A container image consists of layers. A layer is a tarball with contents and metadata. An image can contain many layers, extracted on top of each other to produce the final rootfs. This is what allows layer caching.

Let’s add the following to Containerfile:

RUN yum install -y net-tools iputils iproute

Then, rebuild the container:

podman build .

As you can see, it’s not rebuilding from scratch, it’s starting at the last valid layer from cache:

STEP 1/4: FROM your.registry/docker/redhat/ubi8
STEP 2/4: RUN rm -rf /etc/yum.repos.d/*
--> Using cache ea466a73415f707af68764e15d9d6c792bf81009f9cae73cc4a1e7c79a134ffd
--> ea466a73415
STEP 3/4: COPY your.repo /etc/yum.repos.d
--> Using cache 2f131c882177f3912dc7c9a2fb9e4225a2bc91624ca093d3e7ac6f74ae14b0c8
--> 2f131c88217
STEP 4/4: RUN yum install -y net-tools iputils iproute
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
[...]

Installed:
  iproute-5.18.0-1.el8.x86_64                  iputils-20180629-10.el8.x86_64
  libbpf-0.5.0-1.el8.x86_64                    libmnl-1.0.4-6.el8.x86_64
  net-tools-2.0-0.52.20160912git.el8.x86_64    psmisc-23.1-5.el8.x86_64

Complete!
--> d0216453d6a
d0216453d6ab00bcc6bcea32be90a28918f3c967e05419c29e5284ba5977bed0

Note that if you change a file that is copied inside the container, such as carm.repo in our case, the container builder will consider that the cached layer corresponding to the COPY statement is invalid: it will rebuild that layer and all the next ones.

Other Containerfile instructions

Read containerfile manual for the complete list of instructions which we’re not going to cover here.

Tag an image with `podman tag`

Instead of referring to images by their hashes, we can use arbitrary tags, which we introduced earlier in the container image introduction.

Let’s tag the image we have built as rhel:

podman tag d0216453d6a rhel

And we may now run a container with that image by tag name:

$ podman run --rm -it rhel bash
[root@c6e94dc73b69 /]# exit

You can see container images you have with the podman images command:

$ podman images
REPOSITORY                            TAG         IMAGE ID      CREATED      SIZE
localhost/rhel                        latest      2f131c882177  2 hours ago  214 MB
your.registry/docker/redhat/ubi8  latest      270f760d3d04  3 weeks ago  214 MB

As you can see, the repository for your rhel image was automatically set to localhost.

Publishing images with `podman push`

First, you’ll have to authenticate against the repository you want to push to. GitLab provides a docker registry, so I’ll be using that for the example.

$ podman login --tls-verify=false your.gitlab.registry.hostname
Username: jpic
Password:
Login Succeeded!

Now, let’s tag the rhel image to be part of a repository:

podman tag rhel your.gitlab.registry.hostname/groupname/containers/rhel-tutorial

We can now push it to the repository:

$ podman push --tls-verify=false your.gitlab.registry.hostname/groupname/containers/rhel-tutorial
Getting image source signatures
Copying blob 43a34f146a69 done
Copying blob 6fbe1af03e06 done
Copying blob bccb88911f57 done
Copying config 2f131c8821 done
Writing manifest to image destination
Storing signatures

Container lifecycle

Containers are started on the command line with the run command. Let’s run a container in the background for 60 seconds:

$ podman run --detach rhel sh -c 'sleep 30'
525f8d0b9a03b72a82897fb2a3beec8a8e5265952f8f867ccdcf8b7dde283a8d

`podman ps`

This outputs the container id. Run the ps command:

$ podman ps
CONTAINER ID  IMAGE                  COMMAND         CREATED         STATUS             PORTS       NAMES
525f8d0b9a03  localhost/rhel:latest  sh -c sleep 60  6 seconds ago   Up 5 seconds ago               romantic_jang

As you can see, the container is up and running, and has been assigned a random name. Why? Because we don’t care about containers. They are not supposed to hold any persistent data. We’re supposed to be able to trash container and re-create it at any time. Actually, that’s even how we do version upgrades! No more “Can’t upgrade because it will break the system” because containers are isolated from the system!

`podman ps -a`

Anyway, wait a minute and run the ps command again:

$ podman ps
CONTAINER ID  IMAGE       COMMAND     CREATED     STATUS      PORTS       NAMES

It’s gone! Well, it’s not running anymore, to see all stopped containers we run:

$ podman ps -a
CONTAINER ID  IMAGE                  COMMAND         CREATED       STATUS                   PORTS       NAMES
525f8d0b9a03  localhost/rhel:latest  sh -c sleep 60  1 minute ago  Exited (0) 1 minute ago              romantic_jang

`podman start`

To start a stopped container, pass the container name or id to the start command:

podman start romantic_jang

Now, you’ll see the container is up and running again:

podman ps

`podman rm`

That stopped container is still taking up disk space, and disk space is the resource that containers purposely sacrify, because it’s cheap. Nonetheless, it’s not infinite so we have to clean up.

To remove a stopped container, run:

podman rm romantic_jang

Run podman ps -a again and you’ll see it’s gone:

$ podman ps -a
CONTAINER ID  IMAGE       COMMAND     CREATED     STATUS      PORTS       NAMES

`podman run --restart`

When we deploy a persistent service in a container, we want it to restart automatically, in case of failure, or after a reboot. To do this, we need to set the restart policy when we create the container:

$ podman run --restart always --detach rhel sh -c 'sleep 30'
4de3ffb42c7cfdf330e56d3e2aa38b26eb23b818d9d6335d28132495de12ef38

$ podman ps
CONTAINER ID  IMAGE                  COMMAND         CREATED        STATUS            PORTS       NAMES
4de3ffb42c7c  localhost/rhel:latest  sh -c sleep 30  7 seconds ago  Up 6 seconds ago              gallant_mahavira

$ sleep 30; podman ps
CONTAINER ID  IMAGE                  COMMAND         CREATED         STATUS            PORTS       NAMES
4de3ffb42c7c  localhost/rhel:latest  sh -c sleep 30  42 seconds ago  Up 9 seconds ago              gallant_mahavira

As you can see, the container is restarting automatically.

`podman run` arguments

`podman run` basically has a syntax of `podman run <run-options>
<repository/image> <container command>`, which means that you **MUST** pass
options for the run command **before** the image name.

This won't work: `$ podman run rhel sh -c 'sleep 30' --restart always`

This will: `$ podman run --restart always rhel sh -c 'sleep 30'`

`podman run -it`

If you want to enter a bash shell in a new container, this won’t do the trick:

podman run rhel bash

Because bash will think that we are not running in interactive mode, parse the script from stdin, figure it has none, and exits with 0 status.

Tu run an interactive command such as a console shell in a new container, use -it:

15/03 2023 16:17:16 jpic@TGSHZABDDDEV04I ~
$ podman run -it rhel bash
[root@f28a6591251f /]# exit

15/03 2023 16:18:31 jpic@TGSHZABDDDEV04I ~
$

`podman run --rm`

If you want the container to be deleted automatically after exit, use the --rm option of podman run:

$ podman run -it --rm rhel bash
[root@40483dd07829 /]# exit
exit

As you can see, the container is definitely gone:

$ podman ps -a
CONTAINER ID  IMAGE                  COMMAND         CREATED      STATUS             PORTS       NAMES

$

`podman exec`

We can also enter a running container by executing a new process in it. Run this container in detached mode:

$ podman run --detach rhel sh -c 'sleep 300'
95843b20ec34913782176dd1fe86fe99fd2ee48051366a9464cbe34506bb8238

You can see it in ps:

$ podman ps
CONTAINER ID  IMAGE                  COMMAND          CREATED         STATUS             PORTS       NAMES
95843b20ec34  localhost/rhel:latest  sh -c sleep 300  5 seconds ago   Up 4 seconds ago               funny_cori

And enter the container with the exec command:

$ podman exec -it funny_cori bash
[root@95843b20ec34 /]#

And we can see that we are indeed in the container that has for PID1 a sleep of 300:

[root@95843b20ec34 /]# cat /proc/1/cmdline
/usr/bin/coreutils--coreutils-prog-shebang=sleep/usr/bin/sleep300

Volumes

Earlier, we said that containers should not contain any persistent data, they are considered expendable.

To persist data, containers need volumes. You can use container volumes, which can be interesting if they are backed by some network storage.

Otherwise, the easiest is to just mount a local directory into a container.

We can do so with the --volume option of the run command. With the following command, you can be in the same directory as before entering the container:

podman run -it --rm --volume $(pwd):$(pwd) --workdir $(pwd) rhel bash

Try to run pwd and ls in the container.

Networks

I’m not really sure what to say here.

First, you probably won’t be dealing with networks if you’re just using containers to make pipelines.

As I was writing this part of the tutorial, it stroke me that podman network isolation is not working, so I thought I got it all wrong, and tried with docker and indeed: network isolation works in docker.

The theory is that you can create a container network, with docker or podman network create, and start containers within that network:

they are able to see each other, as they are on the same IP network
they are able to resolve the IP by hostname for each other
they can’t see anything about containers that are not on that network.

And indeed, that’s how it works with docker, and I guess, how it’s supposed to work with podman.

But with podman, none of this works as expected:

containers are unable to resolve any other container by hostname
containers can ping any other container even if not on the same network, probably the bridge interface is not filtering

See for yourself, in podman:

$ podman network create test-network
$ podman run -it --network test-network rhel bash
[root@8aa0d653bae7 /]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether a2:47:ec:c2:f2:6f brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.89.0.2/24 brd 10.89.0.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::a047:ecff:fec2:f26f/64 scope link
       valid_lft forever preferred_lft forever

$ podman run -it --network test-network rhel bash
[root@e0e73678a656 /]# ping -c1 10.89.0.2
PING 10.89.0.2 (10.89.0.2) 56(84) bytes of data.
64 bytes from 10.89.0.2: icmp_seq=1 ttl=64 time=0.107 ms

Even the first container can see the other one:

[root@8aa0d653bae7 /]# ping -c1 10.0.2.100
PING 10.0.2.100 (10.0.2.100) 56(84) bytes of data.
64 bytes from 10.0.2.100: icmp_seq=1 ttl=64 time=0.163 ms

--- 10.0.2.100 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.163/0.163/0.163/0.000 ms

After some research, it turns out this is being discussed in podman issue #17061, bottom line is:

use --internal with podman network create to have isolation
this part of podman is being completely rewritten anyway

`podman inspect`

Last but not least, the inspect command dumps the container configuration, which, I hope, after reading this tutorial, will be mostly understandable to you.

Try different podman run commands and see how the output of podman inspect differs.

More about Podman

container-fu

Control Groups

Namespaces

Network

PID

User

uts

mnt

ipc

Conclusion

Images

Building images with `Containerfile`

`podman build`

`FROM`

`RUN`

`COPY`

Create a container with `podman run`

Layer caching

Other Containerfile instructions

Tag an image with `podman tag`

Publishing images with `podman push`

Container lifecycle

`podman ps`

`podman ps -a`

`podman start`

`podman rm`

`podman run --restart`

`podman run` arguments

`podman run -it`

`podman run --rm`

`podman exec`

Volumes

Networks

`podman inspect`

more

They trust us

Entreprise de services du numérique

Contact

container-fu

Control Groups

Namespaces

Network

PID

User

uts

mnt

ipc

Conclusion

Images

Building images with Containerfile

podman build

FROM

RUN

COPY

Create a container with podman run

Layer caching

Other Containerfile instructions

Tag an image with podman tag

Publishing images with podman push

Container lifecycle

podman ps

podman ps -a

podman start

podman rm

podman run --restart

podman run arguments

podman run -it

podman run --rm

podman exec

Volumes

Networks

podman inspect

more

They trust us

Entreprise de services du numérique

Contact

Building images with `Containerfile`

`podman build`

`FROM`

`RUN`

`COPY`

Create a container with `podman run`

Tag an image with `podman tag`

Publishing images with `podman push`

`podman ps`

`podman ps -a`

`podman start`

`podman rm`

`podman run --restart`

`podman run` arguments

`podman run -it`

`podman run --rm`

`podman exec`

`podman inspect`