BigSudo is a command line generator wrapping around Ansible: the excellent tool for automating operations which has proven itself in an extremely heterogenic ecosystem over the course of the last years, and currently maintained by Red Hat.
eXtreme DevOps is when code traditionnaly known as network and infrastructure
operations automation meet continuous integration, merges with continuous
delivery, made it almost trivial to deploy per-branch ephemeral deployments on
each git push, say on test-$GIT_BRANCHNAME.ci.example.com
, so that the
product team can review a feature during development without forcing the
developer to merge unfinished code into master, in order to keep the master
branch clean and deployable at any moment.
Hacking Operations on network and infrastructure means that we are going to integrate bash scripts into an orchestrator with many expendable strategies. The orchestrator itself proposes an amazing number of plugins natively and has a thriving hacking ecosystem: you can even start from a code copy of many of our BigSudo Ansible roles.
This article focuses on a practice where 99.9% uptime is plenty enough and as such no HA layer is introduced, although HA compatibility is not impossible to do with a bit of code: not the worries of our current userbase anyway.
However, we adopt eXtreme Programing basics plus custom research or development in the field, because we need fast and efficient iterations to create more and faster than competition, or just to spend more time hacking on personal projects !
Our plan:
The CI server is meant to be used by the team, and when reviewing with the product team. It should have one persistent production-like deployment called “staging”. “Staging” is not meant to be more stable than the CI server itself, so if the product / business team is going to show the software outside of production then that deployment should be persistent and happen on the production server.
The complete installation must be completely automated to keep both CI and production servers as symmetric as possible, despite their usage will be different: production needs more disk space, but development needs to be faster.
As such, the partitioning has to be done manually on both servers unless the disks are the same size. Using fdisk, mdadm and btrfs on the command line is perfect.
: the btrfs partition volume for Docker, might not be
necessary, but this is ended becoming a regular automation so I would advise
going through btrfs on /var/lib/docker
: all human user (bob, alice) and project deployment (test, staging,
production…) data. This should be the largest partition in our setup.
: well, we don’t need /boot because we’re booting the default Linux kernel
which probably shows the default config in /proc anyway so we benefit from no
protection if /boot
is not mounted, most likely in this setup 20G is plenty
enough, but 5 should also work after only docker runs in this setup and data
accumulates only in /var/lib/docker
and /home
as described above
: you want disk-time before you can’t SSH connect to your server
anymore when memory becomes insufficient.
: in production, you would want your database to dump which involves a
disk copy on a different raid array, especially if you have HDDs (production:
for disk space rather than speed)
Choose a server with sufficient size for 6 months of production, based on traffic estimates and the likes. You will need to be able to re-install a server in 30 minutes, after 30 minutes have a definitive ETA (file copy in progress from remote backup site ?).
Both of your servers should run with the same domain name (“example.com”), and
different hostnames (ci
, production
). The FQDN (Fully Qualified Domain
Name) of each server will be composed of hostname dot domain: ci.example.com
and production.example.com
, the latter might also be named prod.example.com
in which case it might stand for production. You might want to use the bigsudo yourlabs.fqdn @host
command to sort it out with an interacttive wizard.
It does not really matter how you bootstrap your server as long as it’s
reproducible. My favorite command something in the lines of bigsudo yourlabs.ssh someuser@somehost
does that:
also provides a command to add a user given a github username,
this is the command you could do to add me on your server for support:
bigsudo yourlabs.ssh adduser usergroups=sudo username=jpic @somehost
If your user has a different username on github or published their private key elsewhere, you will need to type a little more:
bigsudo yourlabs.ssh adduser usergroups=sudo username=test key=https://gitlab.com/foobar.keys @somehost
Or, you can still do that manually, but doing the same operation manually over and over might motivate you to script it.
The same load balancer should run on both servers. We use traefik because it’s really well made for the purpose of self configuring dynamically based on container activity in docker.sock and triumphs of HTTPS thanks to LetsEncrypt.
Again it doesn’t matter how you deploy as long as it’s reproducible, here’s how
i like to do it: bigsudo yourlabs.traefik @somehost
, it does the following:
The only difference we want is that if we don’t want to maintain a wildcard DNS certificate from LetsEncrypt then the ci server will have not to redirect all http to https: test deployments which url depend on branch name would otherwise exhaust the LetsEncrypt monthly ratelimit for certificate generation if there’s a bit of development activity !
Docker is used in this setup because it has a great userland, and we’re going
to try to rely on default features as much as possible. If you have deployed
Traefik with bigsudo yourlabs.traefik
then you already have a firewall and
docker is ready, because yourlabs.docker
is a dependency of
and was pulled as such.
creates a docker network that you can also create manually,
named web
, and starts traefik in that. As such, any container spawning in the
web network will become usable by traefik.<
We’re going to split our configuration into multiple file so that we can maintain and reuse them:
for basic service configuration, core dependency of the
exposes the stack being the traefik LBdocker-compose.persist.yml
persists deployments (staging, production …)
on volumes and also enable restart on each servicedocker-compose.override.yml
for local development, startup command
overrides, mount source volumes …docker-compose.basicauth.yml
to enable HTTP basic auth (non-production)docker-compose.maildev.yml
optionnal, adds maildev to the stackSo, the following combinations will be usable:
Persistent deployments will be done in a /home
directory, but ephemeral ones
should not depend on having a directory. Since we still need the docker-compose
file to control ephemeral stacks, we store it somewhere such as
or ~/.yourlabs.compose
if using bigsudo yourlabs.compose
Now for ephemeral deployments, you may add a cron or something to
clean old deployment, but otherwise just restarting the CI server and then
running docker system prune -af
should clean it up because non-persistent
deployments should not have restart: always
Still, bigsudo yourlabs.compose
supports a lifetime
argument which will
create a file removeat
next to the copied compose backup with the timestamp
that it should be destructed after. yourlabs.compose
will also deploy a
systemd timer (with the yourlabs.timer
ansible role, thanks to bigsudo’s
automatic dependency resolution) which will daily check for these lifetime
files on the system and destroy the stacks which lifetime has been reached,
#!/bin/bash -eu
for home in /root /home/*; do
[ -d $home/.yourlabs.compose ] || continue
pushd $home/.yourlabs.compose &> /dev/null
for project in *; do
[ -f $project/removeat ] || continue
if [[ $(date +%s) -gt $(<$project/removeat) ]]; then
pushd $project
docker-compose down --rmi all --volumes --remove-orphans
popd &> /dev/null
rm -rf $project
popd &> /dev/null
The ansible directory should contain a deploy.yml
(you might find “site.yml”)
playbook which orchestrates the deployment work:
It is based on a handful of scripts:
: dump data from container, add them to the encrypted backup repo,
mirror reporestore.sh
: download repo from mirror if repo is absent and print the list
of backups,restore.sh <backup-id>
: extract backup from repo, re-create containers and
load data into themprune.sh
: apply the retencion policy, cleans old backups.So, obviously the backup receipe should install missing packages (restic ? lftp ?), create the persistent directories, generate and upload the backup lifecycle management scripts. Also: automate a backup so that you can still rollback after deploying a problematic data migration.
A basic check at the end of a deployment will help eliminate false-positives,
ie.: container stack started, but service not actually starting right. This is
something we want to catch with a non-infinite loop at least with curl, so that
we can return the proper exit code from deploy.yml
or whatever we name our
deployment script. The point is integration in the continuous integration
pipeline as such the return call is critical: work on eliminating false
A nice technique is to generate a random id, add it to the URL we are going to test, then your check receipe could grep it out of the load balancer logs: try to automate as much as possible on your way. If you find yourself always doing the same thing after a deployment fail result like connecting and grepping log: go ahead and automate it too.
As a developer, I am responsible for the deliveries I make myself to the “production” platform in the hands of end users. As such, my work on a feature finishes not after I push a patch in a branch, not after it has passed tests, not after it was deployed into production, but after I have investigated on the effects of my patch in production: did it cause significant performance regression ? how can I measure the impact on the user base ?
But, prior to doing deployments, ensure that your systems have properly integrated new upstream versions: do the package update and upgrade operations and ensure services are functionnal after a reboot, first on the staging server and once that succeeds fully then on the production server.
Deployments must be as small and often as possible (objective 10 deploys a day), rather than big and once in a month or week.
When you write data migrations there will be a good chances you enjoy the
script after an unexpected data migration input or something (ie.
we have injected legacy data in production).
In the pipeline, any breaks or step that takes more than 5 minutes should be reviewed, considered as “impediments”. A typical pipeline for a full project should not exceed 25 minutes from master to production.
It’s okay if you wrap the line that connects through SSH in a one-liner in a CI script, something like this for example would be fine:
mkdir -p ~/.ssh; echo "$SSH_KEY" > ~/.ssh/id_ed25519; echo "$SSH_FINGERPRINTS" > ~/.ssh/known_hosts; chmod 700 ~/.ssh; chmod 600 ~/.ssh/*
You can use the yourlabs/ansible
image which contains ansible, bigsudo and
the whole yourlabs role distribution.
The default stages proposed by GitLab are “build”, “test” and “deploy”: any job can run in parallel in one stage, but one stage starts only if the previous one completes with success. The system will be similar in all CI executors except minor tradebacks: CircleCI has multi root pipeline but is proprietary for example.
Instead, deploy a CI executor manually on the ci server which we already have configured for btrfs. It doesn’t matter if you deploy GitLab CI or Drone CI or anything else: only the practice itself matters.
Sharing the docker instance between the many deployments happenning on the CI server itself saves from a lot of network connections as thanks to Copy-on-Write is pretty blazing fast:
On merge to master:
On click “deploy to training”:
It doesn’t matter what you use as long as it’s reproducible / scripted between
your two servers. I maintain my own in bigsudo yourlabs.netdata @somehost
, it
will ask if you use something like Slack for your webhook URLs so that you can
get alerting from Slack and Telegram for example. Netdata will warn you if it
calculates that your disk drive will be full in 7 hours for example.
Ideally, you’d configure each netdata to monitor the other, then you don’t need an external uptime monitoring service. Netdata is great at alerting prior to catastrophies, but how you respond to them is another story.
Netdata is great at discovering services running on a server: it will monitor every postgresql database it finds. Note that it won’t look for databases inside containers, which is one of the reasons why you’d rather have postgresql running on the host server for your production deployment.
Database process such as postgres are not ment for running inside containers, they don’t support being killed after N seconds which is what Docker does to containers some times, so that’s something to be pretty careful with. However, this is never a problem in most projects so always be careful to verify your backups prior to operating on a database or file system.
It’s a bit tricky to setup monitoring on databases running inside docker too, so you should favor not containerizing your production database.
You can find source code matching this practice on github.com/betagouv/mrs
I’d like to have less configuration files in the future, if only I could have a nice little Python framework to replace my Dockerfile and docker-compose files which would support ephemeral deployments and backup/restore as well as custom operations out of the box … well that’s what I’m currently working on in yourlabs.io/oss/podctl