Lpi 305 Container Virtualization Concepts

Container Virtualization is here to stay and you should get familiar with it. Nowaday you have different solutions for dealing with containerized environments, for instance, Docker, LXC, buildah, podman and many others. And we should not forget about Container Orchestration using different tools like Mesos, Kubernetes, Docker Swarm, Rancher and OpenShift, and why not through cloud solutions like EKS, GKS and AKS. For this post we’re going to focus on understand the base concepts behind containers, using as base the LPIC 305 - Container Virtualization Concepts topics.

So, let’s roll… :)

For the LPIC - 305 Container Virtualization Concepts topics we have:

Weight: 7 Description: Candidates should understand the concept of container virtualization. This includes understanding the Linux components used to implement container virtualization as well as using standard Linux tools to troubleshoot these components.

Key Knowledge Areas:

Understand the concepts of system and application container
Understand and analyze kernel namespaces
Understand and analyze control groups
Understand and analyze capabilities
Understand the role of seccomp, SELinux and AppArmor for container virtualization
Understand how LXC and Docker leverage namespaces, cgroups, capabilities, seccomp and MAC
Understand the principle of runc
Understand the principle of CRI-O and containerd
Awareness of the OCI runtime and image specifications
Awareness of the Kubernetes Container Runtime Interface (CRI)
Awareness of podman, buildah and skopeo
Awareness of other container virtualization approaches in Linux and other free operating systems, such as rkt, OpenVZ, systemd-nspawn or BSD Jails

The following is a partial list of the used files, terms and utilities:

nsenter
unshare
ip (including relevant subcommands)
capsh
/sys/fs/cgroups
/proc/[0-9]+/ns
/proc/[0-9]+/status

Understanding Containers

Understanding the basics of conteiners are not so complicated, of course you could go ahead and read any basic tutorial or Getting Started with Docker and in 1 or 2 hours you will be fine with containers - I hightly recommend the Docker Getting Started from the Docker documentation for this. But, if you stop there, you’ll only understand the use of a frontend tool and not the base techs that allows you to run a container, and by that I mean understand what a cgroup is, or how namespaces allows you to isolate your network from the containers networks and also what’s a runtime and why I need one.

So, let’s starts the following:

Control Groups
Kernel Namespaces
Containers Capabilities

Linux Control Groups, cgroups

From the GNU/Linux man page we have that cgroups are:

Linux kernel feature which allow processes to be organized into hierarchical groups whose usage of various types of resources can then be limited and monitored

Which means that with cgroups we can have processes’s resources such as CPU time, memory and bandwidth controlled, monitored and limited. For example, if you’re running a task that is going to take some time to finish but you also want to guarantee that it will not consume all your memory or cpu, you could easly run it within a cgroup with limits for memory and cpu time usage.

In your GNU/Linux environment you can take a look at what cgroups controllers you already have by listing the content of the /sys/fs/cgroup/ folder:

1
2
3


~$ ls /sys/fs/cgroup/
blkio  cpuacct      cpuset   freezer  memory   net_cls,net_prio  perf_event  rdma     unified
cpu    cpu,cpuacct  devices  hugetlb  net_cls  net_prio          pids        systemd

If we take for example memory:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


~$ ls -1F /sys/fs/cgroup/memory/
cgroup.clone_children
cgroup.event_control
cgroup.procs
cgroup.sane_behavior
docker/
foo/
memory.failcnt
memory.force_empty
memory.kmem.failcnt
memory.kmem.limit_in_bytes
memory.kmem.max_usage_in_bytes
memory.kmem.slabinfo
memory.kmem.tcp.failcnt
memory.kmem.tcp.limit_in_bytes
memory.kmem.tcp.max_usage_in_bytes
memory.kmem.tcp.usage_in_bytes
memory.kmem.usage_in_bytes
memory.limit_in_bytes
memory.max_usage_in_bytes
memory.memsw.failcnt
memory.memsw.limit_in_bytes
memory.memsw.max_usage_in_bytes
memory.memsw.usage_in_bytes
memory.move_charge_at_immigrate
memory.numa_stat
memory.oom_control
memory.pressure_level
memory.soft_limit_in_bytes
memory.stat
memory.swappiness
memory.usage_in_bytes
memory.use_hierarchy
notify_on_release
release_agent
system.slice/
tasks
user.slice/

We will find a list of limiters that can be applyed to a process. For example, the memory.limit_in_bytes that will limite the amount of memory that a process can use, or the memory.swappiness for the swap relation that the system is using. And the folders inside this directory are just systemd unit types for resource control. In this case we have:

Services: For systemd services
Scope: A group of externally created processes. For example, user sessions, containers, and virtual machines.
Slice: Organize a hierarchy in which scopes and services are placed.

This way, if we take a look at the contant of the user.slice folder we will see a group of files and also the user-1000.slice which have a session-1.scope:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


~$ ls -1F /sys/fs/cgroup/memory/user.slice/user-1000.slice/
cgroup.clone_children
cgroup.event_control
cgroup.procs
memory.failcnt
memory.force_empty
memory.kmem.failcnt
memory.kmem.limit_in_bytes
memory.kmem.max_usage_in_bytes
memory.kmem.slabinfo
memory.kmem.tcp.failcnt
memory.kmem.tcp.limit_in_bytes
memory.kmem.tcp.max_usage_in_bytes
memory.kmem.tcp.usage_in_bytes
memory.kmem.usage_in_bytes
memory.limit_in_bytes
memory.max_usage_in_bytes
memory.memsw.failcnt
memory.memsw.limit_in_bytes
memory.memsw.max_usage_in_bytes
memory.memsw.usage_in_bytes
memory.move_charge_at_immigrate
memory.numa_stat
memory.oom_control
memory.pressure_level
memory.soft_limit_in_bytes
memory.stat
memory.swappiness
memory.usage_in_bytes
memory.use_hierarchy
notify_on_release
session-1.scope/
tasks
'user@1000.service'/

If you run a cat command on the tasks file inside this folder you will be presented with all the processes that you session is holding. In my case I can se the PID 1764 that run the ssh-agent:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42


~$ cat /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-1.scope/tasks 
1440
1441
1442
1467
1468
1470
1472
1473
1474
1582
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1612
1665
1666
1668
1746
1844
1845
1856
~$ cat /proc/1746/comm
ssh-agent

Another way to check this hierarchy is by running the systemd-cgls command:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


~$ systemd-cgls
...
...
...
│   └─1768 /usr/libexec/gvfs-afc-volume-monitor
│   └─session-1.scope 
│     ├─1440 gdm-session-worker [pam/gdm-autologin]
│     ├─1467 /usr/bin/gnome-keyring-daemon --daemonize --login
│     ├─1472 /usr/lib/gdm3/gdm-x-session --run-script env GNOME_SHELL_SESSION_MODE=pop /usr/bin/gnome-ses>
│     ├─1474 /usr/lib/xorg/Xorg vt2 -displayfd 3 -auth /run/user/1000/gdm/Xauthority -background none -no>
│     ├─1668 /usr/libexec/gnome-session-binary --systemd --systemd --session=pop
│     └─1746 /usr/bin/ssh-agent /usr/bin/im-launch env GNOME_SHELL_SESSION_MODE=pop /usr/bin/gnome-sessio>
├─init.scope 
...
...
...

And for monitoring the cgroups resource consuption you could use the systemd-cgtop command, which will return an output like the one bellow:

Control Group                               Tasks   %CPU   Memory  Input/s Output/s
user.slice                                    759    8.7     4.1G        -        -
/                                            1184    4.0     6.5G        -        -
system.slice                                  163    2.1     1.6G        -        -
system.slice/acpid.service                      1    1.3   684.0K        -        -
system.slice/systemd-logind.service             1    0.7     7.3M        -        -
system.slice/systemd-journald.service           1    0.1    82.6M        -        -
system.slice/containerd.service                42    0.0   103.9M        -        -
system.slice/system76-power.service            17    0.0     6.0M        -        -

Namespaces

Boundaries of a process

lsns /proc/*/ns unshared -> runs a program in a namespace unshared from its parent process. nsenter -> enter the namespaces of one or more processes and than executes the specified program.